In my understanding, the distributed KVS typically looks like:
There is a leader which manages metadata
There are multiple followers which manage data
A client interacts with leader
When a client asked to WRITE a data to the leader, it decides which node should own the data then pass it by some hash algorithm (e.g. consistent hash)
Also, the leader copies the data to some other nodes in order not to lose the data in case some outage
This is my understanding. My point is that in this architecture, the data is not copied to all the follower nodes.
However, in etcd, it replicates all the data using Raft. In my understanding it should not be called distributed kvs but just a master-replica replication.
Is there any definition of distributed kvs? Should they be called distributed kvs if it consists of multiple servers? Please let me know it I'm missing some points.
I believe that your deffinition of distributed KVS (Key-Value Store) is really specific. Here is wiki definition of distributed data store:
A distributed data store is a computer network where information is stored on more than one node, often in a replicated fashion. It is usually specifically used to refer to either a distributed database where users store information on a number of nodes, or a computer network in which users store information on a number of peer network nodes.
Etcd fits into this definition. I'd also argue that etcd is more than replication, as there is consensus algorithm (raft as you mentioned) in its hearth. It gives some guarantees that (I believe) replication doesn't give:
Faul tolerance up to (n-1)/2 nodes
None of committed values will be lost by any node failure (as long as we are in fault tolerance boundary)
Related
In Mongo we can go for any of the below model
Simple replication(without shard where one node will be working as master and other as slaves) or
Shard(where data will be distributed on different shard based on partition key)
Both 1 and 2
My question - Can't we have Cassandra just with replication without partitioning just like model_1 in mongo ?
From Cassandra vs MongoDB in respect of Secondary Index?
In case of Cassandra, the data is distributed into multiple nodes based on the partition key.
From above it looks like it is mandatory to distribute the data based on some p[artition key when we have more than one node ?
In Cassandra, replication factor defines how many copies of data you have. Partition key is responsible for distributing of data between nodes. But this distribution may depend on the amount of nodes that you have. For example, if you have 3 nodes cluster & replication factor equal to 3, then all nodes will get data anyway...
Basically your intuition is right: The data is always distributed based on the partition key. The partition key is also called row key or primary key, so you can see: you have one anyway. The 1. case of your mongo example is not doable in cassandra, mainly because cassandra does not know the concept of masters and slaves. If you have a 2 node cluster and a replication factor of 2, then the data will be held on 2 nodes, like Alex Ott already pointed out. When you query (read or write), your client will decide to which to connect and perform the operation. To my knowledge, the default here would be a round robin load balancing between the two nodes, so either of them will receive somewhat the same load. If you have 3 nodes and a replication factor of 2, it becomes a little more tricky. The nice part is though, that you can determine the set of nodes which hold your data in the client code, thus you don't lose any performance by connecting to a "wrong" node.
One more thing about partitions: you can configure some of this, but this would be per server and not per table. I've never used this, and personally i wouldn't recommend to do so. Just stick to the default mechanism of cassandra.
And one word about the secondary index thing. Use materialized views
How to use the ByteOrderedPartitioner (BOP) to force specific key values to be partitioned according to a custom requirement. I want to force Cassandra to partition and replicate data according to custom requirements, without introducing a custom partitioner how far I can control this behavior and how ?
Overall: I want my data starting with particular ID to be at a predefined node because I know data will be accessed from that node heavily. Also like the data to be replicated to nearby nodes.
I want my data starting with particular ID to be at a predefined node because I know data will be accessed from that node heavily.
Looks like that you talk about data locality problem, which is really important in bigdata-like computations (Spark, Hadoop, etc.). But the general approach for that isn't to pin data to specific node, but just to move your whole computation to the data itself.
Pinning data to specific node may cause problems like:
what should you do if your node goes down?
how evenly will the data be distributed among the cluster? Will be there any hotspots/bottlenecks because of node over(under)-utilization?
how can you scale your cluster in future?
Moving computation to data has no issues with these questions, but the approach you going to choose - has.
Found the answer here...
http://www.mail-archive.com/user%40cassandra.apache.org/msg14997.html
Changing the setting "initial_token" in cassandra.yaml file we can let the nodes to be divided into key ranges and partitioning will choose the node which is going to save the first replication of the data and strategy class SimpleStrategy will add the replica to proceeding nodes so by arranging the nodes the way you want you can exploit the replication strategy.
(I could not find a good source explaining this, so if it is available elsewhere, you could just point me to it)
Hazelcast replicates data across all nodes in clusters. So, if data is changed in one of the nodes, does the node update its own copy and then propagate it to other nodes?
I read somewhere that each data is owned by a node, how does Hazelcast determine the owner? Is the owner determined per datastructure or per key in the datastructure?
Does Hazelcast follow "eventually consistent" principle? (When the data is being propagated across the nodes, there could be a small window during which the data might be inconsistent between the nodes)
How are conflicts handled? (Two nodes update the same key-value simultaneously)
Hazelcast does not replicate (with exception of the ReplicatedMap, obviously ;-)) but partitions data. That means you have one node that owns a given key. All updates to that key will go to the owner and he notifies possible updates.
The owner is determined by consistent hashing using the following formula:
partitionId = hash(serialize(key)) % partitionCount
Since there is only one owner per key it is not eventually consistent but consistent whenever the mutating operations is returned. All following read operations will see the new value. Under normal operational circumstances. When any kind of failure happens (network, host, ...) we choose availability over consistency and it might happen that a not yet updated backup is reactivated (especially if you use async backups).
Conflicts can happen after split-brain when the split cluster re-merge. For this case you have to configure (or use the default one) MergePolicy to define the behavior on how conflicting elements are merged together or which one of both wins.
I'm going to implement consistent hashing over a bunch of nodes. Each node has a limited capacity (let's say 1GB). I starts with one node and when it's getting full I'm gonna add another node and use consistent hashing to redistribute the data and move forward by adding new nodes. However there are still chances that a node gets full. I know some nosql databases such as cassandra uses consistent hashing to do something similar to what i'm doing. How can I avoid nodes from overflowing using consistent hashing?
Cassandra does not use consistent hashing in a way you described.
Each table has a partition key (you can think about it as a primary key or first part of it in RDBMS terminology), this key is hashed using murmur3 algorithm. The whole hash space forms a continuos ring from lowest possible hash to the highest. After that this ring is divided into chunks (vnodes, 256 by default) and these chunks are fairly distributed among multiple nodes. Each node hosts not only it's own part of the ring, but also maintains replicated copy of other vnodes according to replication factor.
This way of doing things helps to solve a lot of problems:
balance data load among all cluster nodes, no specific node can be overloaded (data size, reads and writes are evenly distributed, no hot points)
if you add a new node to a cluster, it will handle it's own part of ring and pull required vnodes automatically from other nodes. No need to manual resharding.
if node fails, due to replication you won't miss any data because it is already stored on other nodes. In this case you can decomission failed nodes so all other nodes will redistribute failed ring part among them. No need to have complex switching scenarios for failed db nodes.
Of course, you can always implement similar DB behaviour on top of any RDBMS in your application layer, but it is always much harder and not error-prone than using already existing battle-tested solution.
I guess you know how keys gets moved from one node to another node, when a node is added or deleted. Coming to your question of how uniform distribution happens?
You can have your own logic here to make it happen. You keep on monitoring all the nodes in the hash if any node is getting hot(Handling more keys) insert another node before this node so that the load will be distributed among the old and the new nodes. Similar way if any of the the nodes are under utilised you can delete them so that load will be shift to the next node.
Hope this help..!!
I'm currently delving into CouchDB, and I am puzzled by the distribution of Map-Reduce computations in views. I see a lot of resources mentioning that Map-Reduce is inherently distributed, because you can process one half of your data on server A, the other half on server B, and then reduce both results. One example would be slide 16 of this presentation:
http://www.slideshare.net/gabriele.lana/couchdb-vs-mongodb-2982288
This seems fairly logical, but:
CouchDB does not seem to provide an API for dispatching computations to several servers. The only distribution it appears to provide is replication of the entire data set to other servers (which would then, I assume, compute their own view data).
CouchDB uses a B-Tree to store view data based on keys that are generated in the Map step of the view algorithm, which precludes appropriate partitioning of documents based on what server they should be on.
So, does CouchDB distribute Map-Reduce computations at all? Or is the Map-Reduce property used merely to cache values in the B-Tree nodes?
You are looking for BigCouch, it enables a CouchDB cluster and uses distributed MapReduce.
CouchDB does NOT distributed views across nodes, since couchdb is not a distributed application. You can only continously-replicate from one instance to the other, but still each instance works alone.