I've recently started learning Cassandra and while going through the Datastax docs on Nodesync
Nodesync-Doc
came across this statement where it describers the token range of a segment.
The token ranges can be no smaller than a single partition, so very large partitions can result in segments larger than the configured size.
The way I understood partitioning in Cassandra is that, it's like a single token denotes a single partition. There could be multiple rows for that particular token, but there wouldn't be multiple tokens per Cassandra partition. Did I get it wrong and if not could someone please explain what it's ment by token ranges per partition.
Yes, one partition == token. I think that confusion comes from the segment term that is used by NodeSync - one NodeSync segment may include multiple partitions, but it couldn't be smaller than partition.
Related
I am new to Cassandra, I have a cassandra cluster with 6 nodes. I am trying to find the partition size,
Tried to fetch it with this basic command
nodetool tablehistograms keyspace.tablename
Now, I am wondering how is it calculated and why the result has only 5 records other than min, max, while the number of nodes are 6. Does node size and number of partitions for a table has any relation?
Fundamentally, what I know is partition key is used to hash and distribute data to be persisted across various nodes
When exactly should we go for bucketing? I am assuming that Cassandra has got a partitioner that take care of distributed persistence across nodes.
The number of entries in this column is not related to the number of nodes. It shows the distribution of the values - you have min, max, and percentiles (50/75/95/98/99).
Most of the nodetool commands doesn't show anything about other nodes - they are tools for providing information about current node only.
P.S. This document would be useful in explaining how to interpret this information.
As the name of the command suggests, tablehistograms reports the distribution of metadata for the partitions held by a node.
To add to what Alex Ott has already stated, the percentiles (not percentages) provide an insight on the range of metadata values. For example:
50% of the partitions for the given table have a size of 74KB or less
95% are 263KB or less
98% are 455KB or less
These metadata don't have any correlation with the number of partitions or the number of nodes in your cluster.
You are correct in that the partition key gets hashed and the resulting value determines where the partition (and its associated rows) get stored (distributed among nodes in the cluster). If you're interested, I've explained in a bit more detail with some examples in this post -- https://community.datastax.com/questions/5944/.
As far as bucketing is concerned, you would typically do that to reduce the number of rows in a partition and therefore reducing its size. The general recommendation is to keep your partition sizes less than 100MB for optimal performance but it's not a hard rule -- you can have larger partitions as long as you are aware of the tradeoffs.
In your case, the larges partition is only 455KB so size is not a concern. Cheers!
Although it is asked many times and answered many times, I did not find a good answer anyway.
Neither in forums nor in cassandra docs.
How do virtual nodes work?
Suppose a node having 256 virtual nodes.
And docs say they are distributed randomly.
(put away how that "randomly" done...I have another,more urgent question):
Is that right that every cassandra node ("physical") actually responsible for several distinct locations in the ring? (for 256 locations)? Does that mean the "physical" node sort of "spread" on the whole circle?
How in that case re-balancing works? If I add a new node?
The ring will get an additional 256 nodes.
How those additional nodes will divide the data with the old nodes?
Will they, basically, appear as additional "bicycle spokes" randomly spread through the whole ring?
A lot of info on the internet, but nobody makes a clear explanation...
Vnodes break up the available range of tokens into smaller ranges, defined by the num_tokens setting in the cassandra.yaml file. The vnode ranges are randomly distributed across the cluster and are generally non-contiguous. If we use a large number for num_tokens to break up the token ranges, the random distribution means it is less likely that we will have hot spots.Using statistical computation, the point where all clusters of any size always had a good token range balance was when 256 vnodes were used. Hence, the num_tokens default value of 256 was the recommended by the community to prevent hot spots in a cluster.
Ans 1:- It is a range of tokens based on num_tokens. if you have set 256 the you will get 256 token ranges which is default.
Ans 2:- Yes, when you are adding or removing the nodes the tokens will distribute again in the cluster based on vnodes configurations.
you may refer for more details are here https://docs.datastax.com/en/ddac/doc/datastax_enterprise/dbArch/archDataDistributeVnodesUsing.html
LetsNoSQL answer is correct. See also https://stackoverflow.com/a/37982696/5209009. I'll only add a few more comments:
Yes, the "physical" node is spread on the token range.
As explained in the link, any new node will take 256 new token ranges, dividing some of the existing ones. There is no other rebalancing, it relies on randomness to achieve some rebalancing, that's why it's using a relatively large (256) number of tokens per node.
It's worth mentioning that there is another option. You can run vnodes with a smaller number of tokens per node (4-8) with a token allocation algorithm. Any new tokens will not be allocated randomly, a greedy algorithm will be used so that the new tokens will create a distribution that optimises the load on a given keyspace. It will simply divide in half the token ranges containing most of the data. Since it's not random it can work with a smaller number of tokens (4-8). It's not really relevant for small clusters, but for 100+ nodes it can be.
See https://www.datastax.com/blog/2016/01/new-token-allocation-algorithm-cassandra-30 and https://thelastpickle.com/blog/2019/02/21/set-up-a-cluster-with-even-token-distribution.html.
If the only thing I have available is a com.datastax.driver.core.Session, is there a way to get a rough estimate of row count in a Cassandra table from a remote server? Performing a count is too expensive. I understand I can get a partition count estimate through JMX but I'd rather not assume JMX has been configured. (I think that result must be multiplied by number of nodes and divided by replication factor.) Ideally the estimate would include cluster keys too, but everything is on the table.
I also see there's a size_estimates table in the system keyspace but I don't see much documentation on it. Is it periodically refreshed or do the admins need to run something like nodetool flush?
Aside from not including cluster keys, what's wrong with using this as a very rough estimate?
select sum(partitions_count)
from system.size_estimates
where keyspace_name='keyspace' and table_name='table';
The size estimates is updated on a timer every 5 minutes (overridable with -Dcassandra.size_recorder_interval).
This is a very rough estimate, but you could from the token of the partition key find the range it belongs in and on each of the replicas pull from this table (its local replication and unique to each node, not global) and divide out the size and the number of partitions for a very vague approximate estimate of the partition size. There are so many assumptions and averaging that occurs in this path even before writing to this table. Cassandra errs on efficiency side at cost of accuracy and is more for general uses like spark bulk reading so take it with a grain of salt.
Its not useful now but looking towards the future post 4.0 freeze there will be many new virtual tables, including possibly ones to get accurate statistics on specific and ranges of partitions on demand.
I am trying to store the following structure in cassandra.
ShopID, UserID , FirstName , LastName etc....
The most of the queries on it are
select * from table where ShopID = ? , UserID = ?
That's why it is useful to set (ShopID, UserID) as the primary key.
According to docu the default partitioning key by Cassandra is the first column of primary key - for my case it's ShopID, but I want to distribute the data uniformly on Cassandra cluster, I can not allow that all data from one shopID are stored only in one partition, because some of shops have 10M records and some only 1k.
I can setup (ShopID, UserID) as partitioning keys then I can reach the uniform distribution of records in the Cassandra cluster . But after that I can not receive all users that belong to some shopid.
select *
from table
where ShopID = ?
Its obvious that this query demand full scan on the whole cluster but I have no any possibility to do it. And it looks like very hard constraint.
My question is how to reorganize the data to solve both problem (uniform data partitioning, possibility to make full scan queries) in the same time.
In general you need to make user id a clustering column and add some artificial information to your table and partition key during saving. It allows to break a large natural partition to multiple synthetic. But now you need to query all synthetic partitions during reading to combine back natural partition. So the goal is find a reasonable trade-off between number(size) of synthetic partitions and read queries to combine all of them.
Comprehensive description of possible implementations can be found here and here
(Example 2: User Groups).
Also take a look at solution (Example 3: User Groups by Join Date) when querying/ordering/grouping is performed by clustering column of date type. It can be useful if you also have similar queries.
Each node in Cassandra is responsible for some token ranges. Cassandra derives a token from row's partition key using hashing and sends the record to node whose token range includes this token. Different records can have the same token and they are grouped in partitions. For simplicity we can assume that each cassandra nodes stores the same number of partitions. And we also want that partitions will be equal in size for uniformly distribution between nodes. If we have a too huge partition that means that one of our nodes needs more resources to process it. But if we break it in multiple smaller we increase the chance that they will be evenly distirbuted between all nodes.
However distribution of token ranges between nodes doesn't related with distribution of records between partitions. When we add a new node it just assumes responsibility for even portion of token ranges from other nodes and as the result the even number of partitions. If we had 2 nodes with 3 GB of data, after adding a third node each node stores 2 GB of data. That's why scalability isn't affected by partitioning and you don't need to change your historical data after adding a new node.
We have a table that stores our data partitioned by files. One file is 200MB to 8GB in json - but theres a lot of overhead obviously. Compacting the raw data will lower this drastically. I ingested about 35 GB of json data and only one node got slightly more than 800 MB data. This is possibly due to "write hotspots" -- but we only write once and read only. We do not update data. Currently, we have one partition per file.
By using secondary indexes, we search for partitions in the database that contain a specific geolocation (= first query) and then take the result of this query to range query a time range of the found partitions (= second query). This might even be the whole file if needed but in 95% of the queries only chunks of a partition are queried.
We have a replication factor of 2 on a 6 node cluster. Data is fairly even distributed, every node owns 31,9% to 35,7% (effective) data according to nodetool status *tablename*.
Good read performance is key for us.
My questions:
How big is too big for a partition in terms of volume or row size? Is there a rule of thumb for this?
For Range Query performance: Is it better to split up our "big" partitions to have more smaller partitions? We built our schema with "big" partitions because we thought that when we do range queries on a partition, it would be good to have it all on one node so data can be fetched easily. Note that the data is also available on one replica due to RF 2.
C* supports very huge rows, but it doesn't mean it is a good idea to go to that level. The right limit depends on specific use cases, but a good ballpark value could be between 10k and 50k. Of course, everything is a compromise, so if you have "huge" (in terms of bytes) rows then heavily limit the numbers of rows in each partition. If you have "small" (in terms of bytes) rows them you can relax that limit a bit. This is because one partition means one node only due to your RF=1, so all your query for a specific partition will hit only one node.
Range queries should ideally go to one partition only. A range query means a sequential scan on your partition on the node getting the query. However, you will limit yourself to the throughput of that node. If you split your range queries between more nodes (that is you change the way you partition your data by adding something like a bucket) you need to get data from different nodes as well performing parallel queries, directly increasing the total throughput. Of course you'd lose the order of your records within different buckets, so if the order in your partition matters, then that could not be feasible.