What is the time complexity(Big O) of Cassandra operations? - cassandra

Assume there is only one node with R rows. What is the theoretical time complexity of basic Cassandra operations?
More specifically, I want to know:
key = item. I assume it to be O(log(R)) is it right?
key > item, i.e. slice. Will C* fetch all R rows to judge if the condition is met, which results in O(R)? What about ordered rows?
key > 10 AND key < 12. Will C* first select all that matches key > 10 and then filter with key < 12? Or C* will combine them into a single condition for query?

You don't clarify if you mean reads or wites, although it seem you are talking about read operations. The read path in Cassandra is highly optimized with diffent read caches, bloom filters and different compaction strategies (STCS, LTCS, TWCS) for how the data is structured on disk. Data is written on disk in one or more SSTables and the presence of tombstones degrades read performance, sometimes significantly.
The Cassandra architecture is designed to provide linear scaleability as data volumes grow. The premise of having just a single node would be the major limiting factor in read latency as the number of rows R becomes large.

Related

DynamoDB and Cassandra partitioning strategy

In the Dynamo paper, the author introduced 3 different partitioning strategy:
It seems DynamoDB has evolved from strategy 1 to strategy 3. I have a few questions related to strategy 3:
Refer to:
Since partition ranges are fixed, they can be stored in separate files, meaning a partition can be relocated as a unit by simply transferring the file (avoiding random accesses needed to locate specific items). This simplifies the process of bootstrapping and recovery.
How is it managed at low level? One node can have a few partitions assigned to it. Is each partition handled separately inside the storage engine? For example, does each partition have a separate set of (memtable + SSTables), and they compact at their own paces? This seems to introduce complex to the system and hard to debug if the compaction processes go wild.
It seems the partitioning granularity is fixed beforehand. Is there any way to further partitioning after the initial stage? For example, if a-c is one partition, later on prefix b is hot and becomes a noisy neighbor to prefix a and c, is there a way to isolate b to another node? How do we handle this situation in DynamoDB?
Does Cassandra use strategy 1 or strategy 3? From what I can tell with the num_tokens and initial_token settings in the cassandra.yml, I believe it's strategy 1, am I wrong?
Trying to answer each question in turn:
One node can have a few partitions assigned to it.
Each node will have 1 or more token ranges assigned during bootstrapping - depending on the partitioner this is a numeric range -2^63 to +2^63 for the murmur or 0 to 2^128 for random partitioner.
Each token here can contain a partition (but might not), so while you are thinking of it as the node owning partitions, strictly speaking it is owning token ranges.
Is each partition handled separately inside the storage engine?
This question doesn't really follow - an SSTable can contain 1 or more partitions. A partition can be contained in 1 or more SSTables - e.g. the partition span SSTables.
For example, does each partition have a separate set of (memtable + SSTables), and they compact at their own paces?
No, there will be a memtable for the database table, and then these are flushed to create the SSTables - the compaction of the multiple SStables is determined by the compaction strategy setting, with there being quite different behaviours and advantages / disadvantages to each, depending on the usage scenario. 1 size, does not fit all. Again, each SSTable can contain multiple partitions, and a partition can appear in more than 1 SSTable.
This seems to introduce complex to the system and hard to debug if the compaction processes go wild.
Compaction itself is not a trivial topic, but since the initial premise is not correct, it has not introduced this.
It seems the partitioning granularity is fixed beforehand. Is there any way to further partitioning after the initial stage?
Writing specifically about Cassandra - every time you add or remove a node the token ranges that belong to each node can and will alter. So it is not entirely 'static', but it is not easy to change or manipulate either.
For example, if a-c is one partition, later on prefix b is hot and becomes a noisy neighbor to prefix a and c, is there a way to isolate b to another node?
Again - specific to Cassandra, in theory yes - you calculate the hash value of the partition key, and use initial_token values on a node to give it a very narrow range. In practice no - this is a data model design issue, by the fact that its partitioned in a way which has created a hot spot.
Does Cassandra use strategy 1 or strategy 3? From what I can tell with the num_tokens and initial_token settings in the cassandra.yml, I believe it's strategy 1, am I wrong?
Using num_tokens, which creates vNodes - is in effect dividing the consistent hash ring up more times, so 10 nodes, num_tokens = 16, means that the overall token range is divided into 160 slices, with each node having 10 of them as their partition range. They will hold replicas of other node's ranges of course based on replication factor and rack assignments. If you only had RF=1, then they would only be storing data for the range(s) they are assigned.
Initial_tokens is the setting to control the initial value when the node is bootstrapped - you can choose to calculate it and set it manually, or you can let the partitioner calculate it for you. Further changes on that setting after bootstrap will not have an impact.

Does having unique and lot of small partition in a table effect performance or create extra load in cassandra

I have a table with 4 million unique partition keys
select count(*) from "KS".table;
count
4355748
(1 rows)
I have read the cardinality of Partition Key should not too high and also not too low, which
means don’t make partition key too unique. Is it correct?
The table does not have any clustering key. Will changing data partitioning help with the load?
It really depends on the use case... If you don't have natural clustering by partition, then maybe little sense to introduce it. Also, what are the read patterns? Do you need to read multiple rows in one go, or not?
Number of partitions has an effect on the size of the bloom filter, key cache, etc., so as you increase the number of partitions, bloom filter is increased, and key cache has less hits (until you increase its size).
As far as I know, Cassandra is using consistent hashing for mapping partition key to physical partition, so cardinality should not matter.

How to effectively use spark to read cassandra data that has partition hotspots?

From everything I can tell, spark uses at most one task per cassandra partition when reading from cassandra. Unfortunately, I have a few partitions in cassandra that are enormously unbalanced (bad initial table design). I need to read that data into a new table, which will be better designed to handle the hotspots, but any attempt to do so with normal spark avenues won't work effectively; I'm left with a few tasks (10+) running forever, working on those few enormous partition keys.
To give you an idea of scale, this is working on a table that is about 1.5TB in size, spread over 5 servers with a replication factor of 3; ~ 500GB per node.
Other ideas are welcome, though just dumping to CSV is probably not a realistic option.
Materialized view creation is also a no-go, so far; it takes entirely too long, and at least on 3.0.8, there is little to no monitoring during the creation.
This is a difficult problem which can't really be solved automatically but if you know how your data is distributed within your really huge files I can give you an option.
Instead of doing a single RDD/DataFrame to represent your table, split it into multiple calls which are unioned.
Basically you want to do this
Given our biggest partition is set up like this
Key1 -> C1, C2, C3, ..., C5000000
And we know in general C is distributed like
Min C = 0
Max C = 5000000
Average C = 250000
We can guess that we can cut up these large partitions pretty nicely by doing range pushdowns every 100K C values.
val interval = 100000
val maxValue = 500000
sc.union(
(0 until maxValue by interval).map{ lowerBound =>
sc.cassandraTable("ks", "tab")
.where(s"c > $lowerBound AND c < ${lowerBound + interval}")
}
)
We end up with more smaller partitions (and probably lots of empty ones) but this should let us successfully cut those huge partitions down. This can only be done if you can figure out the distribution of values in the partition though.
Note:: The same thing is possible with union-ing dataframes

One bigger partition or few smaller but more distributed partitions for Range Queries in Cassandra?

We have a table that stores our data partitioned by files. One file is 200MB to 8GB in json - but theres a lot of overhead obviously. Compacting the raw data will lower this drastically. I ingested about 35 GB of json data and only one node got slightly more than 800 MB data. This is possibly due to "write hotspots" -- but we only write once and read only. We do not update data. Currently, we have one partition per file.
By using secondary indexes, we search for partitions in the database that contain a specific geolocation (= first query) and then take the result of this query to range query a time range of the found partitions (= second query). This might even be the whole file if needed but in 95% of the queries only chunks of a partition are queried.
We have a replication factor of 2 on a 6 node cluster. Data is fairly even distributed, every node owns 31,9% to 35,7% (effective) data according to nodetool status *tablename*.
Good read performance is key for us.
My questions:
How big is too big for a partition in terms of volume or row size? Is there a rule of thumb for this?
For Range Query performance: Is it better to split up our "big" partitions to have more smaller partitions? We built our schema with "big" partitions because we thought that when we do range queries on a partition, it would be good to have it all on one node so data can be fetched easily. Note that the data is also available on one replica due to RF 2.
C* supports very huge rows, but it doesn't mean it is a good idea to go to that level. The right limit depends on specific use cases, but a good ballpark value could be between 10k and 50k. Of course, everything is a compromise, so if you have "huge" (in terms of bytes) rows then heavily limit the numbers of rows in each partition. If you have "small" (in terms of bytes) rows them you can relax that limit a bit. This is because one partition means one node only due to your RF=1, so all your query for a specific partition will hit only one node.
Range queries should ideally go to one partition only. A range query means a sequential scan on your partition on the node getting the query. However, you will limit yourself to the throughput of that node. If you split your range queries between more nodes (that is you change the way you partition your data by adding something like a bucket) you need to get data from different nodes as well performing parallel queries, directly increasing the total throughput. Of course you'd lose the order of your records within different buckets, so if the order in your partition matters, then that could not be feasible.

No of SSTtable for given column family

Folks,
We were trying to evaluate CASSANDRA for one of the production application. We had few basic queries which we would like to understand before going forward.
WRITE :
Cassandra uses consistent hashing mechanism to distribute key evenly across nodes. So some key will be available on some Cassandra node.
We further understood that there will be internal SSTTable structure created to store this data within the node.
READ :
While performing a read client will send request to any Cassandra node cluster and based on consistent hashing Cassandra will determine where the key is located on which node.
Following things are not clear.
1) How many SSTTables are created for given key space/column family on a node ( is it some fix number or only 1)
2) Cassandra document describes that there is some broom filter(alternative to standard hashing) which is used to determine whether given key is present in the SSTtable or not ( What if there are 1000 SSTtables there will be 1000 bloom filter which will be checked to determine whether key is present or not.)
1) Number of sstables depend on the compaction strategy and load. To get an idea check out log structured merge trees to have a basic understanding then look at the different compaction strategies (size tiered, leveled, date tiered).
2) Yes there is 1 bloom filter per sstable to give a probabilistic membership of a partition existing in that sstable. Size of bloom filter depends on the number of partitions and the target false positives percentage. They are kept off heap and are generally small, so less a concern now a days than as earlier versions.
Checking out the dynamo and big table papers may help in understanding the principals behind the clustering and storage. There is a lot of free resources on the read/write path and too much to fully go over in a stack overflow question so I would recommend going through some material at the datastax academy or some presentations on youtube.

Resources