Folks,
We were trying to evaluate CASSANDRA for one of the production application. We had few basic queries which we would like to understand before going forward.
WRITE :
Cassandra uses consistent hashing mechanism to distribute key evenly across nodes. So some key will be available on some Cassandra node.
We further understood that there will be internal SSTTable structure created to store this data within the node.
READ :
While performing a read client will send request to any Cassandra node cluster and based on consistent hashing Cassandra will determine where the key is located on which node.
Following things are not clear.
1) How many SSTTables are created for given key space/column family on a node ( is it some fix number or only 1)
2) Cassandra document describes that there is some broom filter(alternative to standard hashing) which is used to determine whether given key is present in the SSTtable or not ( What if there are 1000 SSTtables there will be 1000 bloom filter which will be checked to determine whether key is present or not.)
1) Number of sstables depend on the compaction strategy and load. To get an idea check out log structured merge trees to have a basic understanding then look at the different compaction strategies (size tiered, leveled, date tiered).
2) Yes there is 1 bloom filter per sstable to give a probabilistic membership of a partition existing in that sstable. Size of bloom filter depends on the number of partitions and the target false positives percentage. They are kept off heap and are generally small, so less a concern now a days than as earlier versions.
Checking out the dynamo and big table papers may help in understanding the principals behind the clustering and storage. There is a lot of free resources on the read/write path and too much to fully go over in a stack overflow question so I would recommend going through some material at the datastax academy or some presentations on youtube.
Related
In the Dynamo paper, the author introduced 3 different partitioning strategy:
It seems DynamoDB has evolved from strategy 1 to strategy 3. I have a few questions related to strategy 3:
Refer to:
Since partition ranges are fixed, they can be stored in separate files, meaning a partition can be relocated as a unit by simply transferring the file (avoiding random accesses needed to locate specific items). This simplifies the process of bootstrapping and recovery.
How is it managed at low level? One node can have a few partitions assigned to it. Is each partition handled separately inside the storage engine? For example, does each partition have a separate set of (memtable + SSTables), and they compact at their own paces? This seems to introduce complex to the system and hard to debug if the compaction processes go wild.
It seems the partitioning granularity is fixed beforehand. Is there any way to further partitioning after the initial stage? For example, if a-c is one partition, later on prefix b is hot and becomes a noisy neighbor to prefix a and c, is there a way to isolate b to another node? How do we handle this situation in DynamoDB?
Does Cassandra use strategy 1 or strategy 3? From what I can tell with the num_tokens and initial_token settings in the cassandra.yml, I believe it's strategy 1, am I wrong?
Trying to answer each question in turn:
One node can have a few partitions assigned to it.
Each node will have 1 or more token ranges assigned during bootstrapping - depending on the partitioner this is a numeric range -2^63 to +2^63 for the murmur or 0 to 2^128 for random partitioner.
Each token here can contain a partition (but might not), so while you are thinking of it as the node owning partitions, strictly speaking it is owning token ranges.
Is each partition handled separately inside the storage engine?
This question doesn't really follow - an SSTable can contain 1 or more partitions. A partition can be contained in 1 or more SSTables - e.g. the partition span SSTables.
For example, does each partition have a separate set of (memtable + SSTables), and they compact at their own paces?
No, there will be a memtable for the database table, and then these are flushed to create the SSTables - the compaction of the multiple SStables is determined by the compaction strategy setting, with there being quite different behaviours and advantages / disadvantages to each, depending on the usage scenario. 1 size, does not fit all. Again, each SSTable can contain multiple partitions, and a partition can appear in more than 1 SSTable.
This seems to introduce complex to the system and hard to debug if the compaction processes go wild.
Compaction itself is not a trivial topic, but since the initial premise is not correct, it has not introduced this.
It seems the partitioning granularity is fixed beforehand. Is there any way to further partitioning after the initial stage?
Writing specifically about Cassandra - every time you add or remove a node the token ranges that belong to each node can and will alter. So it is not entirely 'static', but it is not easy to change or manipulate either.
For example, if a-c is one partition, later on prefix b is hot and becomes a noisy neighbor to prefix a and c, is there a way to isolate b to another node?
Again - specific to Cassandra, in theory yes - you calculate the hash value of the partition key, and use initial_token values on a node to give it a very narrow range. In practice no - this is a data model design issue, by the fact that its partitioned in a way which has created a hot spot.
Does Cassandra use strategy 1 or strategy 3? From what I can tell with the num_tokens and initial_token settings in the cassandra.yml, I believe it's strategy 1, am I wrong?
Using num_tokens, which creates vNodes - is in effect dividing the consistent hash ring up more times, so 10 nodes, num_tokens = 16, means that the overall token range is divided into 160 slices, with each node having 10 of them as their partition range. They will hold replicas of other node's ranges of course based on replication factor and rack assignments. If you only had RF=1, then they would only be storing data for the range(s) they are assigned.
Initial_tokens is the setting to control the initial value when the node is bootstrapped - you can choose to calculate it and set it manually, or you can let the partitioner calculate it for you. Further changes on that setting after bootstrap will not have an impact.
I know that secondary indices in Cassandra are generally a bad idea because the index is stored locally in each node i.e. not distributed across the cluster which may result in a query scanning a huge number of nodes. However, I don't understand why they are still a bad idea if I always specify the partition key in my queries and only use the secondary index as a final filter. I've read that they don't scale with large amounts of data even if I specify the partition key. Is this true? and if it's then why?
In general secondary indexes are bad idea, not only for the distributed part, but also for the index size and the number of distinct value, so if you have a field with high or low cardinality,you will be spending time on scanning many rows or many columns.
Also you can have other issue while dealing with tombstones ...
To answer your question, secondary index in Cassandra doesn't scale that good, but if you use a partition key and by it you tell Cassandra which node have the data, it perform really better !
you can find more details here in section F :
https://www.datastax.com/blog/2016/04/cassandra-native-secondary-index-deep-dive
I hope this helps !
These guys have a nice write-up on the performance impacts of secondary indexes:
https://pantheon.io/blog/cassandra-scale-problem-secondary-indexes
The main impact (from the post) is that secondary indexes are local to each node, so to
satisfy a query by indexed value, each node has to query its own records to build the
final result set (as opposed to a primary key query where it is known exactly which node
needs to be queried). So there's not just an impact on writes, but on read performance as
well.
Cassandra on a ring of five machines, with a primary index of user IDs and a secondary index of user emails. If you were to query for a user by their ID or by their primary indexed key any machine in the ring would know which machine has a record of that user. One query, one read from disk. However to query a user by their email or their secondary indexed value each machine has to query its own record of users. One query, five reads from disk. By either scaling the number of users system wide, or by scaling the number of machines in the ring, the noise to signal-to-ratio increases and the overall efficiency of reading drops. In some cases to the point of timing out also.
Please refer below link for good explanation on secondary index.
https://dzone.com/articles/cassandra-scale-problem
If the only thing I have available is a com.datastax.driver.core.Session, is there a way to get a rough estimate of row count in a Cassandra table from a remote server? Performing a count is too expensive. I understand I can get a partition count estimate through JMX but I'd rather not assume JMX has been configured. (I think that result must be multiplied by number of nodes and divided by replication factor.) Ideally the estimate would include cluster keys too, but everything is on the table.
I also see there's a size_estimates table in the system keyspace but I don't see much documentation on it. Is it periodically refreshed or do the admins need to run something like nodetool flush?
Aside from not including cluster keys, what's wrong with using this as a very rough estimate?
select sum(partitions_count)
from system.size_estimates
where keyspace_name='keyspace' and table_name='table';
The size estimates is updated on a timer every 5 minutes (overridable with -Dcassandra.size_recorder_interval).
This is a very rough estimate, but you could from the token of the partition key find the range it belongs in and on each of the replicas pull from this table (its local replication and unique to each node, not global) and divide out the size and the number of partitions for a very vague approximate estimate of the partition size. There are so many assumptions and averaging that occurs in this path even before writing to this table. Cassandra errs on efficiency side at cost of accuracy and is more for general uses like spark bulk reading so take it with a grain of salt.
Its not useful now but looking towards the future post 4.0 freeze there will be many new virtual tables, including possibly ones to get accurate statistics on specific and ranges of partitions on demand.
According to most articles on internet Random Partitioning(RP) is better than Ordered Partitioning(OP) cause of the data distribution.
in fact, I think, that cause of data replication even if we are using the OP the data will be well distributed ! so is the first assumption is still true ?
what about reading performance ? is OP better than RP when trying to read data between two value in the same range ?
thanks a lot
I can't answer really answer confidently for HBase (which only supports Ordered Partitioning to my knowledge), but for Cassandra I would strongly discourage the use of OrderPreservingPartitioner and ByteOrderedPartitioner unless you have a very specific use case that requires it (like if you need to do range scans across keys). It is not very common for Ordered Partitioner to be used
in fact, I think, that cause of data replication even if we are using the OP the data will be well distributed ! so is the first assumption is still true ?
Not particularly, it is much more likely for hotspots to be encountered with an Ordered Partitioner vs. a Random Partitioner. As described from the Partitioners page on the Cassandra Wiki:
globally ordering all your partitions generates hot spots: some partitions close together will get more activity than others, and the node hosting those will be overloaded relative to others. You can try to mitigate with active load balancing but this works poorly in practice; by the time you can adjust token assignments so that less hot partitions are on the overloaded node, your workload often changes enough that the hot spot is now elsewhere. Remember that preserving global order means you can't just pick and choose hot partitions to relocate, you have to relocate contiguous ranges.
There are other problems with Ordered Partitioning that are described well here:
Difficult load balancing:
More administrative overhead is required to load balance the cluster. An ordered partitioner requires administrators to manually calculate partition ranges based on their estimates of the partition key distribution. In practice, this requires actively moving node tokens around to accommodate the actual distribution of data once it is loaded.
Uneven load balancing for multiple tables:
If your application has multiple tables, chances are that those tables have different row keys and different distributions of data. An ordered partitioner that is balanced for one table may cause hot spots and uneven distribution for another table in the same cluster.
With regards to:
what about reading performance ? is OP better than RP when trying to read data between two value in the same range ?
You will definitely achieve better performance for range scans (i.e. get all data between this key and that key).
So it really comes down to the kind of queries you are making. Are range scan queries between keys vital to you? In that case HBase may be a more appropriate solution for you. If it is not as important, there are reasons to consider C* instead. I won't add much more to that as I don't want my answer to devolve into comparing the two solutions :).
I am using cassandra 1.2.15 with ByteOrderedPartitioner in a cluster environment of 4 nodes with 2 replicas. I want to know what are the drawbacks of using the above partitioner in cluster environment? After a long search I found one drawback. I need to know what are the consequences of such drawback?
1) Data will not distribute evenly.
What type of problem will occur if data are not distributed evenly?
Is there is any other drawback with the above partitioner in cluster environment if so, what are the consequences of such drawbacks? Please explain me clearly.
One more question is, Suppose If I go with Murmur3Partitioner the data will distribute evenly. But the order will not be preserved, however this drawback can be overcome with cluster ordering (Second key in the primary keys). Whether my understanding is correct?
As you are using Cassandra 1.2.15, I have found a doc pertaining to Cassandra 1.2 which illustrates the points behind why using the ByteOrderedPartitioner (BOP) is a bad idea:
http://www.datastax.com/documentation/cassandra/1.2/cassandra/architecture/architecturePartitionerBOP_c.html
Difficult load balancing More administrative overhead is required to load balance the cluster. An ordered partitioner
requires administrators to manually calculate partition ranges
(formerly token ranges) based on their estimates of the row key
distribution. In practice, this requires actively moving node
tokens around to accommodate the actual distribution of data once
it is loaded.
Sequential writes can cause hot spots If your application tends to write or update a sequential block of rows at a time, then the
writes are not be distributed across the cluster; they all go to
one node. This is frequently a problem for applications dealing
with timestamped data.
Uneven load balancing for multiple tables If your application has multiple tables, chances are that those tables have different row keys and different distributions of data. An ordered
partitioner that is balanced for one table may cause hot spots and uneven distribution for another table in the same cluster.
For these reasons, the BOP has been identified as a Cassandra anti-pattern. Matt Dennis has a slideshare presentation on Cassandra Anti-Patterns, and his slide about the BOP looks like this:
So seriously, do not use the BOP.
"however this drawback can be overcome with cluster ordering (Second key in the primary keys). Whether my understanding is correct?"
Somewhat, yes. In Cassandra you can dictate the order of your rows (within a partition key) by using a clustering key. If you wanted to keep track of (for example) station-based weather data, your table definition might look something like this:
CREATE TABLE stationreads (
stationid uuid,
readingdatetime timestamp,
temperature double,
windspeed double,
PRIMARY KEY ((stationid),readingdatetime));
With this structure, you could query all of the readings for a particular weather station, and order them by readingdatetime. However, if you queried all of the data (ex: SELECT * FROM stationreads;) the results probably will not be in any discernible order. That's because the total result set will be ordered by the (random) hashed values of the partition key (stationid in this case). So while "yes" you can order your results in Cassandra, you can only do so within the context of a particular partition key.
Also, there have been many improvements in Cassandra since 1.2.15. You should definitely consider using a more recent (2.x) version.