Relationship between node and partition key in cassandra - cassandra

What is the relationship between a node and partition key in cassandra. According to partition key's hash value the data will be stored in a node, is that mean there is "one to one" relationship in between a node and partition key i.e one node contains only one value of hashed value of partition key or a node can contains multiple hashed value of partition keys.
As I'm new to cassandra got confused in this basic point.

partition keys determine the locality of the data. in a cassandra cluster with RF=1, there will be only a single copy of every item, and all the items with the same partition key will be stored in the same node. depending on your usecase, this can be good or bad.
back to your question: it is NOT true that "one node contains only one value of hashed value of partition key" but rather the other way around: all the items with the same partition key would be stored in one node (along with other partition keys, potentially).

Each Node in cassandra is responsible for range of hash value of partition key (Consistent hashing).
By default casssandra uses MurMur3 partitioner.
So on each node in cassandra there will be multiple partition keys availaible. For same partition key there will be only one record on one node, other copies will be available on other nodes based on replication factor.Consistent Hashing in cassandra

Related

Client data isolation: can Cassandra store data in different partitions in separate file sets?

Suppose I have a Cassandra table with an integer partition key.
Question: is it possible to arrange for Cassandra to store the table data and indexes for that table in a sets of files by partition value? Alternative approaches like per partition keyspaces or duplicating tables Account1 (for partition key 1), Account2 (for partition key 2) is deemed to undercut Cassandra performance.
The desired outcome is to reduce the possibility of selecting sensitive client data for partition 1 getting other partitions in the process. If the data is kept separate (and searched separately) this risk is reduced --- obviously not eliminated. Essentially it shifts the responsibility of using the right partition key at the right time somewhat onto Cassandra from the application code.
It's not possible in the Cassandra itself, until you separate data into tables/keyspaces, but as you mentioned - it will lead to bad performance.
DataStax Enterprise (DSE) has functionality called Row Level Access Control that allows you to set permissions based on the value of partition key (or part of partition key).
If you need to stick to plain Cassandra, then you need to do it on the application level.

Using cassandra in single node, should I still worry about choosing a "good" partition key?

We are using cassandra on a single node. I understand that in a cluster, a smart partition key would allow data to be distributed across cluster and will avoid all the keys getting stored on the same host. However in our case, theres just one host and I can use a constant (dummy) partition key but wanted to check if would miss out on something if I do that. For example, cassandra has a limit of having at most 2 billion cells per partition. Does cassandra honor that limit for a single host too? Can I have a table with more than 2 billion cells on a single node cassandra?
Can I have a table with more than 2 billion cells on a single node
cassandra?
Ans: Yes.
Instead of using a constant(dummy) partition key, I would recommend to choose a good partition key. By doing this you remain open for expansion such as in future you may want to use Cassandra in cluster mode. It doesn't matter if you are using Cassandra in single or cluster mode as rows limited to partition keys not entire node. So a single node can have more than 2 billion rows.

Determining write nodes in Cassandra

I've just started reading about Cassandra and I can't quite understand how Cassandra manages to decide which nodes should it write the data to.
What I understand, is that Cassandra uses a part of primary key, specifically partition key, and partitioner to get a token by hashing the partition key, therefore a node/vnode to which that token is bound to.
Now let's say I have 2 nodes in my cluster, each has 256 vnodes on it + I'm not using any clustering keys, just a simple PK and a bunch of simple columns. Hashing partition key would clearly determine where the data should go. By this logic, there would be only 512 unique records available for storage.
Would be funny if true. So am I wrong at the partitioner part?
Consider the base case: just a single node, with a single token. Do you think it can story only one record? Of course not.
The hash determines which node the row will go to, true. But the primary key determines where in the node the row will be stored. And many distinct primary keys may result in the same hash, but they will all be stored separately by the node.

Cassandra: Sort by query

I have a little bit special request.
Constelation: I use a Redis DB to store geo data and use georedius to get them back, sorted by distance. With this keys I search the data in cassandra. But the result of cassandra is sorted in the key or something else.
What I want is, to get the inforamtions back in the same order i requested it.
The partition key is build from id (I get back form redis) and a status.
Could I tell cassandra to sort by id array?
Partition key are designed to be randomly distributed across different nodes. You can use ByteOrderedPartitioner to do ordered queries. But BOP are considered anti-pattern is cassandra and I will highly recommend against it. You can read more about it here Cassandra ByteOrderedPartitioner.
You can add more parameters to the Primary Key which will determine how to store data on the disk. These are known as clustering keys. You can do Order By queries on clustering keys. This is a good document on clustering keys https://docs.datastax.com/en/cql/3.1/cql/ddl/ddl_compound_keys_c.html.
If you can share more schema details, I can suggest what to use as clustering key.

Apache Cassandra Several Partition Keys or Single Computed Key?

I am fairly new to Apache Cassandra and one thing I am having a hard time understanding is whether I should have a table with several partition keys or a single computed key (computed in a application layer).
In my specific case I have 16 partition keys k1...k16 that make a single data element unique. With several partition keys I need to provide them in my select statement and I am okay with this, but are there any pros/cons of doing this in terms of storage and or performance?
The way I understand this is the storage might be more, but the partition keys are 'human readable' and potentially queryable by other clients of this data. I assume that cassandra computes some hash on my partition keys whether it's a single value or several.
My question is there storage/performance issues or any other considerations I should think about with having several partition keys or single application computed partition key?
You are correct, Cassandra converts a multi-part partition key into a single hash. So, I think any efficiencies gains from computing the hash in your application would be minimal at best.
Also, just in case you don't know this, keep in mind that the primary key is divided into the partition key and the clustering keys.
Cheers
Ben

Resources