Inject Custom Sharding in Cassandra or Couchbase - cassandra

Can I inject a sharding algorithm to wither Cassandra or Couchbase?
Or do they decide where each document go to?
For instance if I want to pin data to shards by one of the data properties.

Couchbase hash the key of the document to decide in which shard(vBucket) the document should be associated with. The SDK also uses the same algorithm to find out in which shard the document is located when you want to retrieve the document by its key.
One of the problems of letting developers decide on the sharding algorithm is that sometimes they end up with an excessive number of documents in a single shard, and naturally, this shard becomes the bottleneck of the application.
One of the core concepts in Couchbase is that the documents are (almost) evenly distributed between all shards, so I am not familiar with any native support to insert your own algorithm there.

Cassandra decides where the data goes by the partition key. So if you use the data you want to use as the "pin" as the partition key then it will accomplish what your asking for I think. However, you don't pick the replicas explicitly and it can change as hosts are removed and added to the cluster.

Related

Partition key for Azure Cosmos DB collection

I am bit new to Azure Cosmos DB and trying to understand the concepts.
I want help to decide the the best possible partition key for DocumentDB collection. Please refer image below which have possible partitions using different partition keys.
As mentioned in the blog post here,
An ideal partition key is one that appears frequently as a filter in
your queries and has sufficient cardinality to ensure your solution is
scalable.
From above line, I think, in my case, UserId can be used as partition key.
Can someone please suggest me which key is the best possible candidate for partition key?
From the 10 things to know about DocumentDB Partitioned Collections and micro official document , you could find lots of very good advice about choice of partitioning key, so I'm not going to repeat here.
The selection of partitioning keys depends on the data stored in the database and the frequent query filtering criteria.
It is often advised to partition on something like userid which is good if you have. Suppose your business logic has many queries for a given userid and want to look up no more than a few hundred entries. In such cases the data can be quickly extracted from a single partition without the overhead of having to collate data across partitions.
However, if you have millions of records for the user then partitioning on userid is perhaps the worst option as extracting large volumes of data from a single partition will soon exceed the overhead of collation. In such cases you want to distribute user data as evenly as possible over all partitions. You may need to find another column to be the partition key.
So , if the data volume is very large, I suggest that you do some simple tests based on your business logic and choose the best partitioning key for your performance. After all, the partitioning key cannot be changed once it is set up.
Hope it helps you.
It depends, but here are few things to consider:
The blog post you mentioned say:
Additionally, the storage size for documents belonging to the same partition key is limited to 10GB. An ideal partition key is one that appears frequently as a filter in your queries and has sufficient cardinality to ensure your solution is scalable.
Also, I really recommend to check this post and video, https://learn.microsoft.com/en-us/azure/cosmos-db/partition-data,
The choice of the partition key is an important decision that you have to make at design time. You must pick a property name that has a wide range of values and has even access patterns.
So make sure to choose a partition Key that has many values and meets those requirements.

Big data solution for frequent queries

I need a big data storage solution for batch inserts of denormalized data which happen infrequently and queries on the inserted data which happen frequently.
I've gone through Cassandra and feel that its not that good for batch inserts, but an OK solution for querying. Also, it would be good if there was a mechanism to segregate data separately based on a data attribute.
As you mentioned Cassandra I will talk about it:
Can you insert in an unbatched way or is this impossed by the system? If you can insert unbatched, Cassandra will probably be able to handle it easily.
Batched inserts should also be handable by Cassandra nodes, but this won't distribute the load properly among all the nodes (NOTE: I'm talking about load balancing, not about data balance, which will be only depending on your partition key setup). If you are not very familiar with Cassandra you could tell us your data structure and your query types and we could suggest you how to use Cassandra's data model to fit it.
For the filtering part of the question, Cassandra has clustering keys and secondary indexes, that are basically like adding another column configuration to the clustering key so that you have both for querying.

Cassandra: Data Type for Partition Key - Decimal or UUID

I want to describe the problem I am working on first:
Currently I try to find a strategy that would allow me to migrate data from an existing PostgreSQL database into a Cassandra cluster. The primary key in the PostgreSQL is a decimal value with 25 digits. When I migrate the data, it would be nice if I could keep the value of the current primary key in one way or another and use it to uniquely identify the data in Cassandra. This key should be used as the partition key in Cassandra (no other columns are involved in the table I am talking about). After doing some research, I found out that a good practise is to use UUIDs in Cassandra. So now I have two possible solutions to solve my problem:
I can either create a transformation rule, that would transfer my current decimal primary keys from the PostgrSQL database into UUIDs for Cassandra. Everytime someone requests to access some of the old data, I would have to reapply the transformation rule to the key and use the UUID to search for the data in Cassandra. The transformation would happen in an application server, that manages all communication with Cassandra (so no client will talk to Cassandra directly) New data added to Cassandra would of course be stored with an UUID.
The other solution, which I already have implemented in Java at the moment, is to use a decimal value as the partition key in Cassandra. Since it is possible, that multiple application servers will talk to Cassandra concurrently, my current approach is to generate a UUID in my application and transform it into a decimal value. Using this approach, I could simply reuse all the existing primary keys form PostgreSQL.
I cannot simply create new keys for the existing data, since other applications have stored their own references to the old primary key values and will therefore try to request data with those keys.
Now here is my question: Both approaches seem to work and end up with unique keys to identify my data. The distribution of data across all node should also be fine. But I wonder, if there is any benefit in using a UUID over a decimal value as partition key or visa versa. I don't know exactly what Cassandra does to determine the hash value of the partition key and therefore cannot determine if any data type is to be preferred. I am using the Murmur3Partitioner for Cassandra if that is relevant.
Does anyone have any experience with this issue?
Thanks in advance for answers.
There are two benefits of UUID's that I know of.
First, they can be generated independently with little chance of collisions. This is very useful in distributed systems since you often have multiple clients wanting to insert data with unique keys. In RDBMS we had the luxury of auto-incrementing fields to give uniqueness since that could easily be done atomically, but in a distributed database we don't have efficient global atomic locks to do that.
The second advantage is that UUID's are fairly efficient in terms of storage, and only require eight bytes.
As long as your old decimal values are unique, you should be able to use them as partition keys.

Cassandra: Controlling which node receives data

My understanding of Cassandra's recommended clustering approach is to ensure that each node in the cluster receives an equal distribution of data, by hashing a document's unique Id. My question is if there is a way to change this and define a custom key for "intelligently" routing a document to a specific node in the cluster?
In my scenario, I have data which relates to a specific entity (think client-project-task-item) Across all my data; I will have enough items to require some horizontal scaling; however, each search will always relate to a given client-project-task for which the data set is only a moderate size.
Is there a way to create this type of partitioning / routing (different names I've seen for the same thing) logic in Cassandra?
Thanks; Brent
Clustering approach in Cassandra is not just for an equal distribution of data. It also ensures that all read/write operations are distributed across the cluster to make these operations faster. In addition to this, most likely you will have replication factor greater than 1 to ensure data redundancy so that a node failure does not result in the data loss.
Back to your question and to your own answer. If you use the same partition key for the data, this guarantees that Cassandra partitioning will store the primary replica of the data on the same node, and even more, it will store them in the same partition, ("wide row" in an old way of naming).
I think - http://www.datastax.com/documentation/cql/3.0/share/glossary/gloss_partition_key.html - is the answer I'm looking for
The first column declared in the PRIMARY KEY definition, or in the case of a compound key, multiple columns can declare those columns that form the primary key.

Choosing a NoSQL database

I need a NoSQL database that will run on Windows Azure that works well for the following parameters. Right now Azure Table Storage, HBase and Cassandra seems to be the most promising options.
1 billion entities
up to 100 reads per second, though caching will mostly make it much less
around 10 - 50 writes per second
Strong consistency would be a plus, so perhaps HBase would be better than Cassandra in that regard.
Querying will often be done on a secondary in-memory database with various indexes in addition to ElasticSearch or Windows Azure Search for fulltext search and perhaps some filtering.
Azure Table Storage looks like it could be nice, but from what I can tell, the big difference between Azure Table Storage and HBase is that HBase supports updating and reading values for a single property instead of the whole entity at once. I guess there must be some disadvantages to HBase however, but I'm not sure what they would be in this case.
I also think crate.io looks like it could be interesting, but I wonder if there might be unforseen problems.
Anyone have any other ideas of the advantages and disadvantages of the different databases in this case, and if any of them are really unsuited for some reason?
I currently work with Cassandra and I might help with a few pros and cons.
Requirements
Cassandra can easily handle those 3 requirements. It was designed to have fast reads and writes. In fact, Cassandra is blazing fast with writes, mostly because you can write without doing a read.
Also, Cassandra keeps some of its data in memory, so you could even avoid the secondary database.
Consistency
In Cassandra you choose the consistency in each query you make, therefore you can have consistent data if you want to. Normally you use:
ONE - Only one node has to get or accept the change. This means fast reads/writes, but low consistency (You can have other machine delivering the older information while consistency was not achieved).
QUORUM - 51% of your nodes must get or accept the change. This means not as fast reads and writes, but you get FULL consistency IF you use it in BOTH reads and writes. That's because if more than half of your nodes have your data after you inserted/updated/deleted, then, when reading from more than half your nodes, at least one node will have the most recent information, which would be the one to be delivered.
Both this options are the ones recommended because they avoid single points of failure. If all machines had to accept, if one node was down or busy, you wouldn't be able to query.
Pros
Cassandra is the solution for performance, linear scalability and avoid single points of failure (You can have machines down, the others will take the work). And it does most of its management work automatically. You don't need to manage the data distribution, replication, etc.
Cons
The downsides of Cassandra are in the modeling and queries.
With a relational database you model around the entities and the relationships between them. Normally you don't really care about what queries will be made and you work to normalize it.
With Cassandra the strategy is different. You model the tables to serve the queries. And that happens because you can't join and you can't filter the data any way you want (only by its primary key).
So if you have a database for a company with grocery stores and you want to make a query that returns all products of a certain store (Ex.: New York City), and another query to return all products of a certain department (Ex.: Computers), you would have two tables "ProductsByStore" and "ProductsByDepartment" with the same data, but organized differently to serve the query.
Materialized Views can help with this, avoiding the need to change in multiple tables, but it is to show how things work differently with Cassandra.
Denormalization is also common in Cassandra for the same reason: Performance.

Resources