Writing another DHT-ready torrent client I run into a question regarding announcing on DHT. It is clear that I have to sent get_peers to the nodes closer and closer to the searched info hash until at least one node respond with a list of peers having that info hash.
As I understand I can find multiple nodes responding with overlapped list of peers knowing the info hash. Now the question is, should I announce my presence to all nodes which returned list of peers or pick just one. What are the recommendations on this account?
Maybe I'm mistaken and it works somewhat different and my assumption is wrong and it is not possible to have multiple nodes with list of peers for the same info hash?
It is clear that I have to sent get_peers to the nodes closer and closer to the searched info hash until at least one node respond with a list of peers having that info hash.
You actually perform an iterative lookup until responses do not return any new node contact information that is closer to the target key than the K closest entries that have responded and included a write token.
Now the question is, should I announce my presence to all nodes which returned list
Only to the K-closest-node-set. In case an announce fails (error message or no response) you can also backtrack to ensure you get at least K store requests acknowledged.
Related
This question already has an answer here:
How Kademlia tree of nodes relates to the infohash of a torrent file?
(1 answer)
Closed 3 years ago.
I am working on Mainline DHT and I don't understand one nuance.
Here: https://www.bittorrent.org/beps/bep_0005.html writes: "A "distance metric" is used to compare two node IDs or a node ID and an info hash for "closeness."
Also writes: "Announce that the peer, controlling the querying node, is downloading a torrent on a port. announce_peer has four arguments: "id" containing the node ID of the querying node, "info_hash" containing the info hash of the torrent, "port" containing the port as an integer, and the "token" received in response to a previous get_peers query."
So, for example, we have a peer with ID 223456789zxcvbnmasdf which IP is 86.23.145.714 and port is: 7853
I know that this peer downloaded 2 torrent files with info hashes: 023456789zxcvbnmasdf and 123456789zxcvbnmasdf.
So how exactly my k-bucket record should look like?
Like this:
{"id": "223456789zxcvbnmasdf", "ip": "86.23.145.714", "port": "7853", "torrents": ["023456789zxcvbnmasdg", "123456789zxcvbnmasdh"]} ?
Or should torrent files be like an "equivalent" record (with duplicated ips and ports) in k-buckets along with peer:
{"id": "223456789zxcvbnmasdf", "ip": "86.23.145.714", "port": "7853"},
{"id": "023456789zxcvbnmasdg", "ip": "86.23.145.714", "port": "7853"},
{"id": "123456789zxcvbnmasdh", "ip": "86.23.145.714", "port": "7853"} ?
I am asking because this is not just implementation nuance. Because "k" is normally 20 or some other integer in all clients. So if I would use k-buckets to store torrent files as full-right members, I would have less space to store real peers data.
Thanks for answers!
How you structure your data internally is up to you. All it has to do is fulfill the contract of the specification. In principle one could associate torrents with buckets based on the xor distance – e.g. for resource accounting reasons – but most implementations keep the routing table and the storage separate.
The primary routing table only contains nodes, structural members of the DHT overlay itself. Torrents on the other hand are not part of the overlay. They're data stored on top of the overlay, the hash table abstraction. Hence the name, Distributed Hash Table. I.e. they exist on different abstraction levels.
k-buckets datastructure is an implementation detail of bit-torrent protocol so that peers reply quickly enough to FIND_PEERS and FIND_VALUE.
What I do in my kademlia implementation is that keep the the routing table in a python dictionary and I compute the nearest peers under the 5 seconds by default which is the timeout I use to wait for a UDP reply. To achieve that I need to keep the routing table below 1 000 000 entries.
Like I said above, the routing table is a simple python dict that maps peerid to (address, port).
The routing table stores peers not values ie. not infohash addresses.
When I receive a FIND_PEERS message, the program replies with the following code:
async def peers(self, address, uid):
"""Remote procedure that returns peers that are near UID"""
log.debug("[%r] find peers uid=%r from %r", self._uid, uid, address)
# XXX: if this takes more than 5 seconds (see RPCProtocol) it
# will timeout in the other side.
uids = nearest(self._replication, self._peers.keys(), uid)
out = [self._peers[x] for x in uids]
return out
When I receive a FIND_VALUE message, the program replies with the following code:
async def value(self, address, key):
"""Remote procedure that returns the associated value or peers that
are near KEY"""
log.debug("[%r] value key=%r from %r", self._uid, key, address)
out = await lookup(key)
if out is None:
# value is not found, reply with peers that are near KEY
out = nearest(self._replication, self._peers.keys(), uid)
return (b"PEERS", out)
else:
return (b"VALUE", out)
Here is the definition of nearest:
def nearest(k, peers, uid):
"""Return K nearest to to UID peers in PEERS according to XOR"""
# XXX: It only works with len(peers) < 10^6 more than that count
# of peers and the time it takes to compute the nearest peers will
# timeout after 5 seconds on the other side. See RPCProtocol and
# Peer.peers.
return nsmallest(k, peers, key=functools.partial(operator.xor, uid))
That is it sorts the peers according to their peerid and return the k smallest. nsmallest is supposed to be an optimized version of sort(peers, key=functools.partial(operator.xor, uid))[:k] where uid is a peerid or infohash (respectively FIND_PEERS and FIND_VALUE).
Now back at your question(s):
Is hashinfo equivalent to peer ID in Mainline DHT?
hashinfo is a hash-something that is the same kind of hash as peerid ie. they are possible keys in the routing table. That is, torrent files are associated with a hash, peers are associated with the same kind of hash called peerid. And peers have the "ownership" of keys that near their peerid. BUT hashinfo are not stored in the routing table or k-buckets if you prefer. hashinfo are stored in another mapping that associate hashinfo hash with their value(s).
I am asking because this is not just implementation nuance. Because "k" is normally 20 or some other integer in all clients. So if I would use k-buckets to store torrent files as full-right members, I would have less space to store real peers data.
There is mis-understanding here about the same thing I try explain above. hashinfo are keys in storage dictionary. Whereas peerid are keys in the routing table aka. k-buckets datastructure. They both have the same format because that is how the kademlia routing algorithm work. You must be able to compare hashinfo with peerid with xor to be able to tell which peers "own" which hashinfo value.
As you can see in the second snippet, when a another peer ask for a value associated with an hash, it calls lookup(key) that is something like storage.get(key) except in my case the code stores values in a database.
Maybe there is a mis-understanding about the fact that k-buckets are used to store DHT routing information. And that on top of that, torrent protocol use the DHT to store torrent routing information.
For what it is worth, qadom's peer.py file is where I implement a DHT inspired from kademlia (except I use 256 bits hash and forgo alpha and k parameters and use a single REPLICATION parameter). The code works most of the time check the tests.
Also, there is another project I got inspiration from called simply kademlia that (try to?) implement k-buckets.
As far as I understand, torrent DHT routing looks like qadom bag functions except the receiving peer must authenticate the announcement, whereas in qadom bags are free-for-all.
Also, check the original kademlia paper.
I'm working on a side project and trying to monitor peers on popular torrents, but I can't see how I can get a hold of the full dataset.
If the theoretical limit on routing table size is 1,280 (from 160 buckets * bucket size k = 8) then I'm never going to be able to hold the full number of peers on a popular torrent (~9,000 on a current top-100 torrent)
My concern with simulating multiple nodes is low efficiency due to overlapping values. I would assume that their bootstrapping paths being similar would result in similar routing tables.
Your approach is wrong since it would violate reliability goals of the DHT, you would essentially be performing an attack on the keyspace region and other nodes may detect and blacklist you and it would also simply be bad-mannered.
If you want to monitor specific swarms don't collect data passively from the DHT.
if the torrents have trackers, just contact them to get peer lists
connect to the swarm and get peer lists via PEX which provides far more accurate information than the DHT
if you really want to use the DHT perform active lookups (get_peers) at regular intervals
I think I'm missing something here or confusing terms perhaps.
What happens to the key:value pairs stored at a peer in the overlay DHT when that peer leaves the p2p network? Are they moved to the new appropriate nearest successor? Is there a standard mechanism for this if that is the case.
My understand is that the successors and predecessor peer information of adjacent peers has to be modified as expected when a peer leaves however I can't seem to find information on what happens to the actual data stored at that peer. How is the data kept complete in the DHT as peer churn occurs?
Thank you.
This usually is not part of the abstract routing algorithm that's at the core of a DHT but implementation-specific behavior instead.
Usually you will want to store the data on multiple nodes neighboring the target key that way you'll get some redundancy to handle the failures.
To keep it alive you can either have the originating node republish it in regular intervals or have the storage nodes replicate it among each other. The latter causes a bit less traffic if done properly, but is more complicated to implement.
I have read Kademila spec and DHT BEP for Bittorent but still can't understand how DHT makes trackerless torrents reliable.
My understanding of routing procedure is:
Node (say A) picks node with id closest to infohash of torrent from its routing table (say B) and sends find_peers query to it
If B doesn't have information about peers it sends addresses of nodes with id closer to infohash
Node A makes iterative routing until it reaches node (say X) that responds with seeding peers addresses
When node A starts download process node A announces it to node X
But what happens when node X vanishes from swarm? Is there any failover? How tracking information are distributed across nodes in swarm?
First of all, the DHT is a global overlay shared between all bittorrent clients, it's not specific to individual swarms.
Second, straight from the paper, section 2.3:
To store a (key,value) pair, a participant locates the k closest nodes
to the key and sends them storE RPCs. Additionally, each node
re-publishes (key,value) pairs as necessary to keep them alive, as
described later in Section 2.5 . This ensures persistence (as we show
in our proof sketch) of the (key,value) pair with very high
probability.
This question is about the routing table creation at a node in a p2p network based on Pastry.
I'm trying to simulate this scheme of routing table creation in a single JVM. I can't seem to understand how these routing tables are created from the point of joining of the first node.
I have N independent nodes each with a 160 bit nodeId generated as a SHA-1 hash and a function to determine the proximity between these nodes. Lets say the 1st node starts the ring and joins it. The protocol says that this node should have had its routing tables set up at this time. But I do not have any other nodes in the ring at this point, so how does it even begin to create its routing tables?
When the 2nd node wishes to join the ring, it sends a Join message(containing its nodeID) to the 1st node, which it passes around in hops to the closest available neighbor for this 2nd node, already existing in the ring. These hops contribute to the creation of routing table entries for this new 2nd node. Again, in the absence of sufficient number of nodes, how do all these entries get created?
I'm just beginning to take a look at the FreePastry implementation to get these answers, but it doesn't seem very apparent at the moment. If anyone could provide some pointers here, that'd be of great help too.
My understanding of Pastry is not complete, by any stretch of the imagination, but it was enough to build a more-or-less working version of the algorithm. Which is to say, as far as I can tell, my implementation functions properly.
To answer your first question:
The protocol says that this [first] node should have had its routing tables
set up at this time. But I do not have any other nodes in the ring at
this point, so how does it even begin to create its routing tables?
I solved this problem by first creating the Node and its state/routing tables. The routing tables, when you think about it, are just information about the other nodes in the network. Because this is the only node in the network, the routing tables are empty. I assume you have some way of creating empty routing tables?
To answer your second question:
When the 2nd node wishes to join the ring, it sends a Join
message(containing its nodeID) to the 1st node, which it passes around
in hops to the closest available neighbor for this 2nd node, already
existing in the ring. These hops contribute to the creation of routing
table entries for this new 2nd node. Again, in the absence of
sufficient number of nodes, how do all these entries get created?
You should take another look at the paper (PDF warning!) that describes Pastry; it does a rather good job of explain the process for nodes joining and exiting the cluster.
If memory serves, the second node sends a message that not only contains its node ID, but actually uses its node ID as the message's key. The message is routed like any other message in the network, which ensures that it quickly winds up at the node whose ID is closest to the ID of the newly joined node. Every node that the message passes through sends their state tables to the newly joined node, which it uses to populate its state tables. The paper explains some in-depth logic that takes the origin of the information into consideration when using it to populate the state tables in a way that, I believe, is intended to reduce the computational cost, but in my implementation, I ignored that, as it would have been more expensive to implement, not less.
To answer your question specifically, however: the second node will send a Join message to the first node. The first node will send its state tables (empty) to the second node. The second node will add the sender of the state tables (the first node) to its state tables, then add the appropriate nodes in the received state tables to its own state tables (no nodes, in this case). The first node would forward the message on to a node whose ID is closer to that of the second node's, but no such node exists, so the message is considered "delivered", and both nodes are considered to be participating in the network at this time.
Should a third node join and route a Join message to the second node, the second node would send the third node its state tables. Then, assuming the third node's ID is closer to the first node's, the second node would forward the message to the first node, who would send the third node its state tables. The third node would build its state tables out of these received state tables, and at that point it is considered to be participating in the network.
Hope that helps.