Cassandra: Data Type for Partition Key - Decimal or UUID - cassandra

I want to describe the problem I am working on first:
Currently I try to find a strategy that would allow me to migrate data from an existing PostgreSQL database into a Cassandra cluster. The primary key in the PostgreSQL is a decimal value with 25 digits. When I migrate the data, it would be nice if I could keep the value of the current primary key in one way or another and use it to uniquely identify the data in Cassandra. This key should be used as the partition key in Cassandra (no other columns are involved in the table I am talking about). After doing some research, I found out that a good practise is to use UUIDs in Cassandra. So now I have two possible solutions to solve my problem:
I can either create a transformation rule, that would transfer my current decimal primary keys from the PostgrSQL database into UUIDs for Cassandra. Everytime someone requests to access some of the old data, I would have to reapply the transformation rule to the key and use the UUID to search for the data in Cassandra. The transformation would happen in an application server, that manages all communication with Cassandra (so no client will talk to Cassandra directly) New data added to Cassandra would of course be stored with an UUID.
The other solution, which I already have implemented in Java at the moment, is to use a decimal value as the partition key in Cassandra. Since it is possible, that multiple application servers will talk to Cassandra concurrently, my current approach is to generate a UUID in my application and transform it into a decimal value. Using this approach, I could simply reuse all the existing primary keys form PostgreSQL.
I cannot simply create new keys for the existing data, since other applications have stored their own references to the old primary key values and will therefore try to request data with those keys.
Now here is my question: Both approaches seem to work and end up with unique keys to identify my data. The distribution of data across all node should also be fine. But I wonder, if there is any benefit in using a UUID over a decimal value as partition key or visa versa. I don't know exactly what Cassandra does to determine the hash value of the partition key and therefore cannot determine if any data type is to be preferred. I am using the Murmur3Partitioner for Cassandra if that is relevant.
Does anyone have any experience with this issue?
Thanks in advance for answers.

There are two benefits of UUID's that I know of.
First, they can be generated independently with little chance of collisions. This is very useful in distributed systems since you often have multiple clients wanting to insert data with unique keys. In RDBMS we had the luxury of auto-incrementing fields to give uniqueness since that could easily be done atomically, but in a distributed database we don't have efficient global atomic locks to do that.
The second advantage is that UUID's are fairly efficient in terms of storage, and only require eight bytes.
As long as your old decimal values are unique, you should be able to use them as partition keys.

Related

it is possible to insert or update by giving not all primary keys in casssandra database?

I have an application which using cassandra as database. and the row of the table filled by three separate moments (by three inputs) , I have four primary keys in that table and these primary keys not available at all the moment going to insert or update.
the error is:
Some partition key parts are missing' when trying to insert or update.
Please consider that my application have to a lot of (near 300,000) writes in to database in a short time of interval , so i want to consider the maximum writes available in db.
May be it is a criteria can solve the issue ,'first read from db then write into db and use dummy values for primary key if it is not available at the moment of inserting or updating' . But it will take place more activities about a another copy of 300,000 reads in the db and it will slow the entire processes of db and my application.
So i looking for another solution.
four primary keys in that table and these primary keys not available at all the moment going to insert or update.
As you are finding out, that is not possible. For partition keys in particular, they are used (hashed) to determine which node in the cluster is primarily responsible for the data. As it is a fundamental part of the Cassandra write path, it must be complete at write-time and cannot be changed/updated later.
The same is true with clustering keys (keys which determine the on-disk sort order of all data within a partition). Omitting one or more will yield this message:
Some clustering keys are missing
Unfortunately, there isn't a good way around this. A row's keys must all be known before writing. It's worth mentioning that keys in Cassandra are unique, so any attempt to update them will result in a new row.
Perhaps writing the data to a streaming topic or message broker (like Pulsar or Kafka) beforehand would be a better option? Then, once all keys are known, the data (message) can consumed from the topic and written to Cassandra.

Inject Custom Sharding in Cassandra or Couchbase

Can I inject a sharding algorithm to wither Cassandra or Couchbase?
Or do they decide where each document go to?
For instance if I want to pin data to shards by one of the data properties.
Couchbase hash the key of the document to decide in which shard(vBucket) the document should be associated with. The SDK also uses the same algorithm to find out in which shard the document is located when you want to retrieve the document by its key.
One of the problems of letting developers decide on the sharding algorithm is that sometimes they end up with an excessive number of documents in a single shard, and naturally, this shard becomes the bottleneck of the application.
One of the core concepts in Couchbase is that the documents are (almost) evenly distributed between all shards, so I am not familiar with any native support to insert your own algorithm there.
Cassandra decides where the data goes by the partition key. So if you use the data you want to use as the "pin" as the partition key then it will accomplish what your asking for I think. However, you don't pick the replicas explicitly and it can change as hosts are removed and added to the cluster.

Partition key for Azure Cosmos DB collection

I am bit new to Azure Cosmos DB and trying to understand the concepts.
I want help to decide the the best possible partition key for DocumentDB collection. Please refer image below which have possible partitions using different partition keys.
As mentioned in the blog post here,
An ideal partition key is one that appears frequently as a filter in
your queries and has sufficient cardinality to ensure your solution is
scalable.
From above line, I think, in my case, UserId can be used as partition key.
Can someone please suggest me which key is the best possible candidate for partition key?
From the 10 things to know about DocumentDB Partitioned Collections and micro official document , you could find lots of very good advice about choice of partitioning key, so I'm not going to repeat here.
The selection of partitioning keys depends on the data stored in the database and the frequent query filtering criteria.
It is often advised to partition on something like userid which is good if you have. Suppose your business logic has many queries for a given userid and want to look up no more than a few hundred entries. In such cases the data can be quickly extracted from a single partition without the overhead of having to collate data across partitions.
However, if you have millions of records for the user then partitioning on userid is perhaps the worst option as extracting large volumes of data from a single partition will soon exceed the overhead of collation. In such cases you want to distribute user data as evenly as possible over all partitions. You may need to find another column to be the partition key.
So , if the data volume is very large, I suggest that you do some simple tests based on your business logic and choose the best partitioning key for your performance. After all, the partitioning key cannot be changed once it is set up.
Hope it helps you.
It depends, but here are few things to consider:
The blog post you mentioned say:
Additionally, the storage size for documents belonging to the same partition key is limited to 10GB. An ideal partition key is one that appears frequently as a filter in your queries and has sufficient cardinality to ensure your solution is scalable.
Also, I really recommend to check this post and video, https://learn.microsoft.com/en-us/azure/cosmos-db/partition-data,
The choice of the partition key is an important decision that you have to make at design time. You must pick a property name that has a wide range of values and has even access patterns.
So make sure to choose a partition Key that has many values and meets those requirements.

Apache Cassandra Several Partition Keys or Single Computed Key?

I am fairly new to Apache Cassandra and one thing I am having a hard time understanding is whether I should have a table with several partition keys or a single computed key (computed in a application layer).
In my specific case I have 16 partition keys k1...k16 that make a single data element unique. With several partition keys I need to provide them in my select statement and I am okay with this, but are there any pros/cons of doing this in terms of storage and or performance?
The way I understand this is the storage might be more, but the partition keys are 'human readable' and potentially queryable by other clients of this data. I assume that cassandra computes some hash on my partition keys whether it's a single value or several.
My question is there storage/performance issues or any other considerations I should think about with having several partition keys or single application computed partition key?
You are correct, Cassandra converts a multi-part partition key into a single hash. So, I think any efficiencies gains from computing the hash in your application would be minimal at best.
Also, just in case you don't know this, keep in mind that the primary key is divided into the partition key and the clustering keys.
Cheers
Ben

Cassandra: Controlling which node receives data

My understanding of Cassandra's recommended clustering approach is to ensure that each node in the cluster receives an equal distribution of data, by hashing a document's unique Id. My question is if there is a way to change this and define a custom key for "intelligently" routing a document to a specific node in the cluster?
In my scenario, I have data which relates to a specific entity (think client-project-task-item) Across all my data; I will have enough items to require some horizontal scaling; however, each search will always relate to a given client-project-task for which the data set is only a moderate size.
Is there a way to create this type of partitioning / routing (different names I've seen for the same thing) logic in Cassandra?
Thanks; Brent
Clustering approach in Cassandra is not just for an equal distribution of data. It also ensures that all read/write operations are distributed across the cluster to make these operations faster. In addition to this, most likely you will have replication factor greater than 1 to ensure data redundancy so that a node failure does not result in the data loss.
Back to your question and to your own answer. If you use the same partition key for the data, this guarantees that Cassandra partitioning will store the primary replica of the data on the same node, and even more, it will store them in the same partition, ("wide row" in an old way of naming).
I think - http://www.datastax.com/documentation/cql/3.0/share/glossary/gloss_partition_key.html - is the answer I'm looking for
The first column declared in the PRIMARY KEY definition, or in the case of a compound key, multiple columns can declare those columns that form the primary key.

Resources