Routing table creation at a node in a Pastry P2P network - p2p

This question is about the routing table creation at a node in a p2p network based on Pastry.
I'm trying to simulate this scheme of routing table creation in a single JVM. I can't seem to understand how these routing tables are created from the point of joining of the first node.
I have N independent nodes each with a 160 bit nodeId generated as a SHA-1 hash and a function to determine the proximity between these nodes. Lets say the 1st node starts the ring and joins it. The protocol says that this node should have had its routing tables set up at this time. But I do not have any other nodes in the ring at this point, so how does it even begin to create its routing tables?
When the 2nd node wishes to join the ring, it sends a Join message(containing its nodeID) to the 1st node, which it passes around in hops to the closest available neighbor for this 2nd node, already existing in the ring. These hops contribute to the creation of routing table entries for this new 2nd node. Again, in the absence of sufficient number of nodes, how do all these entries get created?
I'm just beginning to take a look at the FreePastry implementation to get these answers, but it doesn't seem very apparent at the moment. If anyone could provide some pointers here, that'd be of great help too.

My understanding of Pastry is not complete, by any stretch of the imagination, but it was enough to build a more-or-less working version of the algorithm. Which is to say, as far as I can tell, my implementation functions properly.
To answer your first question:
The protocol says that this [first] node should have had its routing tables
set up at this time. But I do not have any other nodes in the ring at
this point, so how does it even begin to create its routing tables?
I solved this problem by first creating the Node and its state/routing tables. The routing tables, when you think about it, are just information about the other nodes in the network. Because this is the only node in the network, the routing tables are empty. I assume you have some way of creating empty routing tables?
To answer your second question:
When the 2nd node wishes to join the ring, it sends a Join
message(containing its nodeID) to the 1st node, which it passes around
in hops to the closest available neighbor for this 2nd node, already
existing in the ring. These hops contribute to the creation of routing
table entries for this new 2nd node. Again, in the absence of
sufficient number of nodes, how do all these entries get created?
You should take another look at the paper (PDF warning!) that describes Pastry; it does a rather good job of explain the process for nodes joining and exiting the cluster.
If memory serves, the second node sends a message that not only contains its node ID, but actually uses its node ID as the message's key. The message is routed like any other message in the network, which ensures that it quickly winds up at the node whose ID is closest to the ID of the newly joined node. Every node that the message passes through sends their state tables to the newly joined node, which it uses to populate its state tables. The paper explains some in-depth logic that takes the origin of the information into consideration when using it to populate the state tables in a way that, I believe, is intended to reduce the computational cost, but in my implementation, I ignored that, as it would have been more expensive to implement, not less.
To answer your question specifically, however: the second node will send a Join message to the first node. The first node will send its state tables (empty) to the second node. The second node will add the sender of the state tables (the first node) to its state tables, then add the appropriate nodes in the received state tables to its own state tables (no nodes, in this case). The first node would forward the message on to a node whose ID is closer to that of the second node's, but no such node exists, so the message is considered "delivered", and both nodes are considered to be participating in the network at this time.
Should a third node join and route a Join message to the second node, the second node would send the third node its state tables. Then, assuming the third node's ID is closer to the first node's, the second node would forward the message to the first node, who would send the third node its state tables. The third node would build its state tables out of these received state tables, and at that point it is considered to be participating in the network.
Hope that helps.

Related

What happens if a node in Cassandra database fails while transferring data to client?

Let's say we have a Cassandra cluster of 6 nodes and the RF=3. Thus, if we query to extract data from a particular node and while processing or transferring data the node fails. What are the possible outcomes for the following scenario?
Lets say its processing the required data from the disk and the node dies in the process, will the coordinator(the node that received our request) resend the request to one of the replicated nodes or just return an error to client?
Let's say the node died while it was transferring data. So will the coordinator return partial data? or will the coordinator realise that the information is incomplete and re-send the request to a different node(a replica)?
In either of the cases, as a programmer do we have to explicitly code any conditions to tell the Cassandra sever or is it all taken care internal?
Thanks in advance.
P.S: I am sorry if a similar question has been asked before. I did try searching but I couldn't find it.
One of the most important concepts to understand in Cassandra is its variable "Consistency Level", or CL. Perhaps the most common setting is CL=QUORUM, which means that with RF=3 (each piece of data is replicated on 3 nodes), Cassandra will require two successful responses from two replicas before returning a result to the client.
In a request for a particular partition, the coordinator will start out by sending out the client's requests to 2 of the 3 replicas known to hold the partition. Cassandra keeps an estimate of average response latency, and when this estimate has passed, it sends a third request to the third replica. Such a timeout will happen in the cases you mentioned - if the response doesn't complete quickly (it doesn't matter if it partially completed), the third request is sent. Unless two nodes are down at the same time, you will get your complete response and the client doesn't need to take care of anything. This is the "high availability" feature that Cassandra and other NoSQL databases are famous for.
Note that this answer is true even for extremely long responses (scanning the entire table, or fetching a very long partition). Such long responses are broken up to "pages" of reasonable lengths, each page is fetched in a separate request, and can come from 2 of the 3 replicas, not necessarily the same one.
Everything I wrote above also applies to Scylla, as well as Cassandra.

Why can't cassandra survive the loss of no nodes without data loss. with replication factor 2

Hi I was trying out different configuration using the site
https://www.ecyrd.com/cassandracalculator/
But I could not understand the following results show for configuration
Cluster size 3
Replication Factor 2
Write Level 1
Read Level 1
You can survive the loss of no nodes without data loss.
For reference I have seen the question Cassandra loss of a node
But it still does not help to understand why Write level 1 will with replication 2 would make my cassandra cluster not survive the loss of no node without data loss?
A write request goes to all replica nodes and the even if 1 responds back , it is a success, so assuming 1 node is down, all write request will go to the other replica node and return success. It will be eventually consistent.
Can someone help me understand with an example.
I guess what the calculator is working with is the worst case scenario.
You can survive the loss of one node if your data is available redundantly on two out of three nodes. The thing with write level ONE is, that there is no guarantee that the data is actually present on two nodes right after your write was acknowledged.
Let's assume the coordinator of your write is one of the nodes holding a copy of the record you are writing. With write level ONE you are telling the cluster to acknowledge your write as soon as the write was committed to one of the two nodes that should hold the data. The coordinator might do that before even attempting to contact the other node (to boost latency percieved by the client). If in that moment, right after acknowledging the write but before attempting to contact the second node the coordinator node goes down and cannot be brought back, then you lost that write and the data with it.
When you read or write data, Cassandra computes the hash token for the data and distributes to respective nodes. When you have 3 node cluster with replication factor as 2 means your data is stored in 2 nodes. So at a point when 2 nodes are down which is responsible for a token A and this token is not part of node 3, eventually even you have one node you will still have TokenRangeOfflineException.
The point is we need replicas(Token) and not the nodes. Also see the similar question answered here.
This is the case because the write level is 1. And if the your application is writing on 1 node only (and waiting data to get eventually consistent/sync, which is going to take non-zero time), then data can get lost if that one server itself is lost before sync could happen

kademlia closest good nodes won't intersect enough between two requests

working on bep44 implementation, i use the defined kademlia algorithm to find the closest good node given an hash id.
Using my program i do go run main.go -put "Hello World!" -kname mykey -salt foobar2 -b public and get the value stored over a hundred nodes (good).
Now, when i run it multiple consecutive times, the sets of ip which are written by the put requests poorly intersects.
It is a problem as when i try to do a get request, the set of ips queried does not intersect with the put set, so the value is not found.
In my tests i use the public dht bootstrap node
"router.utorrent.com:6881",
"router.bittorrent.com:6881",
"dht.transmissionbt.com:6881",
When i query the nodes, I select the 8 closest nodes (nodes := s.ClosestGoodNodes(8, msg.InfoHash())), which usually end up in a list of ~1K queries after a recursive traversal.
In my understanding, storing addresses of the info hash in the dht table is deterministic given the status of the table. As i m doing consecutive queries i expect the table to change, indeed, but not that much.
How does it happen the store nodes set does not intersect ?
Since BEP44 is an extension it is only supported by a subset of the DHT nodes, which means the iterative lookup mechanism needs to take support into account when determining whether the set of closest nodes is stable and the lookup can be terminated.
If a node returns a token, v or seq field in in a get response then it is eligible for the closest-set of a read-only get.
If a node returns a token then it is eligible for the closest-set for a get that will be followed by put operation.
So your lookup may home in on a set of nodes in the keyspace that is closest to the target ID but not eligible for the operations in question. As long as you have candidates that are closer than the best known eligible contacts you have to continue searching. I call this perimeter widening, as it conceptually broadens the search area around the target.
Additionally you also need to take error responses or the absence of a response into account when performing put requests. You can either retry the node or try the next eligible node instead.
I have written down some additional constraints that one might want to put on the closest set in lookups for robustness and security reasons in the documentation of my own DHT implementation.
which usually end up in a list of ~1K queries after a recursive traversal.
This suggests something is wrong with your lookup algorithm. In my experience a lookup should only take somewhere between 60 and 200 udp requests to find its target if you're doing a lookup with concurrent requests, maybe even fewer when it is sequential.
Verbose logs of the terminal sets to eyeball how the lookups make progress and how much junk I am getting from peers have served me well.
In my tests i use the public dht bootstrap node
You should write your routing table to disk and reload it from there and only perform bootstrapping when none of the persisted nodes in your RT are reachable. Otherwise you are wasting the bootstrap nodes' resources and also waste time by having to re-populate your routing table first before performing any lookup.

How does 'distributed tracker' concept work in Bittorrent DHT?

I have read Kademila spec and DHT BEP for Bittorent but still can't understand how DHT makes trackerless torrents reliable.
My understanding of routing procedure is:
Node (say A) picks node with id closest to infohash of torrent from its routing table (say B) and sends find_peers query to it
If B doesn't have information about peers it sends addresses of nodes with id closer to infohash
Node A makes iterative routing until it reaches node (say X) that responds with seeding peers addresses
When node A starts download process node A announces it to node X
But what happens when node X vanishes from swarm? Is there any failover? How tracking information are distributed across nodes in swarm?
First of all, the DHT is a global overlay shared between all bittorrent clients, it's not specific to individual swarms.
Second, straight from the paper, section 2.3:
To store a (key,value) pair, a participant locates the k closest nodes
to the key and sends them storE RPCs. Additionally, each node
re-publishes (key,value) pairs as necessary to keep them alive, as
described later in Section 2.5 . This ensures persistence (as we show
in our proof sketch) of the (key,value) pair with very high
probability.

How does hazelcast request gets routed to right partition?

I am a beginner wrt Hazelcast and trying to understand the following.
In a normal peer to peer set-up with 3 clusters with each being an individual partition. On a request, how is the right partition picked? Is there any router which helps every request? How is the request served?
Thanks
Hazelcast doesn't use consistent hashing so the answer given by Jeremie B is not exactly accurate.
There's a couple of important concepts in Hazelcast:
Partitions - by default there's 271 partitions, that are evenly spread among the nodes. Each node owns "primary" partitions and contains backup "partitions".
Hash function - allows mapping of the key to partition, so in simplified version it looks like this hash(key) % partitionCount = partition
Partition table - keeps the mapping between partitions and nodes, or to be more precise between partitions and replicas. The first replica of each partition is the "primary" partition, the second, third... are the backups.
In order to contact the right node:
a "smart" client keeps track of the "Partition Table".
it uses the hashing algorithm to calculate the partition where the key is stored.
it looks up that partition in the "Partition Table" and connects to the node that contains the given replica.
There's also a concept of a dummy client which doesn't know to which node it should connect. The requests issued by a dummy client are routed to the right node by the node it connects to (if it's not the right node by coincidence).
The core of Hazelcast is based on a "Distributed Hash Table", without a master node. It works with two shared knowledge between nodes:
On ordered list of node participating in the cluster
A hash function
For the 1/, Hazelcast use the list of node ordered from the oldest to the youngest. This information is "easy" to get and don't need to be synchronized through some election. The 2/ is just some code/configuration.
The principle of the DHT is simple: Imagine you have three nodes, ordered A B and C. If you want to know wich node is responsible for a key K, you simply hash the key, and take this value modulo 3. If you have 0, it's the node A, if you have 1, it's the node B, and 2, it's the node C.
Of course, it's only a simplified view of Hazelcast: For example, each structure are split into X partitions, and each node owns more than one partition. Moreover, each partition is replicated. So for each partition, there are one "master" node and several "backup" nodes. But you got the point: no master node, no routing node, every node "knows" where each data belong to.

Resources