Determining write nodes in Cassandra - cassandra

I've just started reading about Cassandra and I can't quite understand how Cassandra manages to decide which nodes should it write the data to.
What I understand, is that Cassandra uses a part of primary key, specifically partition key, and partitioner to get a token by hashing the partition key, therefore a node/vnode to which that token is bound to.
Now let's say I have 2 nodes in my cluster, each has 256 vnodes on it + I'm not using any clustering keys, just a simple PK and a bunch of simple columns. Hashing partition key would clearly determine where the data should go. By this logic, there would be only 512 unique records available for storage.
Would be funny if true. So am I wrong at the partitioner part?

Consider the base case: just a single node, with a single token. Do you think it can story only one record? Of course not.
The hash determines which node the row will go to, true. But the primary key determines where in the node the row will be stored. And many distinct primary keys may result in the same hash, but they will all be stored separately by the node.

Related

Relationship between node and partition key in cassandra

What is the relationship between a node and partition key in cassandra. According to partition key's hash value the data will be stored in a node, is that mean there is "one to one" relationship in between a node and partition key i.e one node contains only one value of hashed value of partition key or a node can contains multiple hashed value of partition keys.
As I'm new to cassandra got confused in this basic point.
partition keys determine the locality of the data. in a cassandra cluster with RF=1, there will be only a single copy of every item, and all the items with the same partition key will be stored in the same node. depending on your usecase, this can be good or bad.
back to your question: it is NOT true that "one node contains only one value of hashed value of partition key" but rather the other way around: all the items with the same partition key would be stored in one node (along with other partition keys, potentially).
Each Node in cassandra is responsible for range of hash value of partition key (Consistent hashing).
By default casssandra uses MurMur3 partitioner.
So on each node in cassandra there will be multiple partition keys availaible. For same partition key there will be only one record on one node, other copies will be available on other nodes based on replication factor.Consistent Hashing in cassandra

Cassandra: Sort by query

I have a little bit special request.
Constelation: I use a Redis DB to store geo data and use georedius to get them back, sorted by distance. With this keys I search the data in cassandra. But the result of cassandra is sorted in the key or something else.
What I want is, to get the inforamtions back in the same order i requested it.
The partition key is build from id (I get back form redis) and a status.
Could I tell cassandra to sort by id array?
Partition key are designed to be randomly distributed across different nodes. You can use ByteOrderedPartitioner to do ordered queries. But BOP are considered anti-pattern is cassandra and I will highly recommend against it. You can read more about it here Cassandra ByteOrderedPartitioner.
You can add more parameters to the Primary Key which will determine how to store data on the disk. These are known as clustering keys. You can do Order By queries on clustering keys. This is a good document on clustering keys https://docs.datastax.com/en/cql/3.1/cql/ddl/ddl_compound_keys_c.html.
If you can share more schema details, I can suggest what to use as clustering key.

Apache Cassandra Several Partition Keys or Single Computed Key?

I am fairly new to Apache Cassandra and one thing I am having a hard time understanding is whether I should have a table with several partition keys or a single computed key (computed in a application layer).
In my specific case I have 16 partition keys k1...k16 that make a single data element unique. With several partition keys I need to provide them in my select statement and I am okay with this, but are there any pros/cons of doing this in terms of storage and or performance?
The way I understand this is the storage might be more, but the partition keys are 'human readable' and potentially queryable by other clients of this data. I assume that cassandra computes some hash on my partition keys whether it's a single value or several.
My question is there storage/performance issues or any other considerations I should think about with having several partition keys or single application computed partition key?
You are correct, Cassandra converts a multi-part partition key into a single hash. So, I think any efficiencies gains from computing the hash in your application would be minimal at best.
Also, just in case you don't know this, keep in mind that the primary key is divided into the partition key and the clustering keys.
Cheers
Ben

Cassandra: Data Type for Partition Key - Decimal or UUID

I want to describe the problem I am working on first:
Currently I try to find a strategy that would allow me to migrate data from an existing PostgreSQL database into a Cassandra cluster. The primary key in the PostgreSQL is a decimal value with 25 digits. When I migrate the data, it would be nice if I could keep the value of the current primary key in one way or another and use it to uniquely identify the data in Cassandra. This key should be used as the partition key in Cassandra (no other columns are involved in the table I am talking about). After doing some research, I found out that a good practise is to use UUIDs in Cassandra. So now I have two possible solutions to solve my problem:
I can either create a transformation rule, that would transfer my current decimal primary keys from the PostgrSQL database into UUIDs for Cassandra. Everytime someone requests to access some of the old data, I would have to reapply the transformation rule to the key and use the UUID to search for the data in Cassandra. The transformation would happen in an application server, that manages all communication with Cassandra (so no client will talk to Cassandra directly) New data added to Cassandra would of course be stored with an UUID.
The other solution, which I already have implemented in Java at the moment, is to use a decimal value as the partition key in Cassandra. Since it is possible, that multiple application servers will talk to Cassandra concurrently, my current approach is to generate a UUID in my application and transform it into a decimal value. Using this approach, I could simply reuse all the existing primary keys form PostgreSQL.
I cannot simply create new keys for the existing data, since other applications have stored their own references to the old primary key values and will therefore try to request data with those keys.
Now here is my question: Both approaches seem to work and end up with unique keys to identify my data. The distribution of data across all node should also be fine. But I wonder, if there is any benefit in using a UUID over a decimal value as partition key or visa versa. I don't know exactly what Cassandra does to determine the hash value of the partition key and therefore cannot determine if any data type is to be preferred. I am using the Murmur3Partitioner for Cassandra if that is relevant.
Does anyone have any experience with this issue?
Thanks in advance for answers.
There are two benefits of UUID's that I know of.
First, they can be generated independently with little chance of collisions. This is very useful in distributed systems since you often have multiple clients wanting to insert data with unique keys. In RDBMS we had the luxury of auto-incrementing fields to give uniqueness since that could easily be done atomically, but in a distributed database we don't have efficient global atomic locks to do that.
The second advantage is that UUID's are fairly efficient in terms of storage, and only require eight bytes.
As long as your old decimal values are unique, you should be able to use them as partition keys.

Cassandra: Controlling which node receives data

My understanding of Cassandra's recommended clustering approach is to ensure that each node in the cluster receives an equal distribution of data, by hashing a document's unique Id. My question is if there is a way to change this and define a custom key for "intelligently" routing a document to a specific node in the cluster?
In my scenario, I have data which relates to a specific entity (think client-project-task-item) Across all my data; I will have enough items to require some horizontal scaling; however, each search will always relate to a given client-project-task for which the data set is only a moderate size.
Is there a way to create this type of partitioning / routing (different names I've seen for the same thing) logic in Cassandra?
Thanks; Brent
Clustering approach in Cassandra is not just for an equal distribution of data. It also ensures that all read/write operations are distributed across the cluster to make these operations faster. In addition to this, most likely you will have replication factor greater than 1 to ensure data redundancy so that a node failure does not result in the data loss.
Back to your question and to your own answer. If you use the same partition key for the data, this guarantees that Cassandra partitioning will store the primary replica of the data on the same node, and even more, it will store them in the same partition, ("wide row" in an old way of naming).
I think - http://www.datastax.com/documentation/cql/3.0/share/glossary/gloss_partition_key.html - is the answer I'm looking for
The first column declared in the PRIMARY KEY definition, or in the case of a compound key, multiple columns can declare those columns that form the primary key.

Resources