How can I increase the number of peers in my routing table associated with a given infohash - bittorrent

I'm working on a side project and trying to monitor peers on popular torrents, but I can't see how I can get a hold of the full dataset.
If the theoretical limit on routing table size is 1,280 (from 160 buckets * bucket size k = 8) then I'm never going to be able to hold the full number of peers on a popular torrent (~9,000 on a current top-100 torrent)
My concern with simulating multiple nodes is low efficiency due to overlapping values. I would assume that their bootstrapping paths being similar would result in similar routing tables.

Your approach is wrong since it would violate reliability goals of the DHT, you would essentially be performing an attack on the keyspace region and other nodes may detect and blacklist you and it would also simply be bad-mannered.
If you want to monitor specific swarms don't collect data passively from the DHT.
if the torrents have trackers, just contact them to get peer lists
connect to the swarm and get peer lists via PEX which provides far more accurate information than the DHT
if you really want to use the DHT perform active lookups (get_peers) at regular intervals

Related

Is there a way to leverage the Bittorrent DHT for small data

I have a situation where I have a series of mostly connected nodes that need to sync a pooled dataset. They are files from 200-1500K and update at irregular intervals between 30min to 6hours depending on the environment. Right now the numbers of nodes are in the hundreds, but ideally, that will grow.
Currently, I am using libtorrent right now to keep a series of files in sync between a cluster of nodes. I do a dump every few hours and create a new torrent based on the prior one. I then associate it using the strategy of BEP 38. The infohash is then posted to a known entry in the DHT where the other nodes poll to pick it up.
I am wondering if there is a better way to do this. The reason I like BitTorrent was originally for firmware updates. I do not need to worry about nodes less than awesome connectivity, and with the DHT it can self assemble reasonably well. It was then extended to sync these pooled files.
I am currently trying to see if I can make an extension that would allow me to have each node do an announce_peer for each new record. Then in theory interested parties would be able to listen for that. That brings up two big issues:
How do I let the interested nodes know that there is new data?
If I have a thousand or more nodes adding new infohashes every few minutes what will that do to the DHT?
I will admit it feels like I am trying to drive a square peg into a round hole, but I really would like to keep as few protocols in play at a time.
How do I let the interested nodes know that there is new data?
You can use BEP46 to notify clients of the most recent version of a torrent.
If I have a thousand or more nodes adding new infohashes every few minutes what will that do to the DHT?
It's hard to give a general answer here. Is each node adding a distinct dataset? Or are those thousands of nodes going to participate in the same pooled data and thus more or less share one infohash? The latter should be fairly efficient since not all of them even need to announce themselves, they could just do a read-only lookup, try to connect to the swarm and only do an announce when there are not enough reachable peers. This would be similar to the put optimiation for mutable items

Why Pastry DHT has an efficient routing

Recently I read some articles about Pastry DHT.The articles said that Pastry DHT has an efficient routing.In Pastry's routing,each step's node ID has a longer common prefix with the destination node,but node IDs are assigned randomly so it is possible that message travel a very very long distance before it arrive the destination and as a result the routing is not efficient.
For example,a Pastry routing,the destination node ID is d467c4,the starting node ID is 65a1fc,the routing process is 65a1fc->d13da3->d4213f->d462ba->d46702->d467c4.It is possible that nodes on this routing are all over the world(IDs are assigned randomly).The message will travel around the world before it arrive the final node.So this routing is not efficient.
So why Pastry DHT has an efficient routing?
That depends on your notion of efficiency. When designing overlay networks the first concern usually is to bound the total number of hops relative to the network size. In other words if there are n nodes you don't want O(n) routes, O(log n) is the usual goal because it can be achieved without total network awareness.
Route length in terms of latency, path cost or minimum bandwidth along the link are second-rank concerns. That are often achieved by adding some sort of locality-awareness or clustering after the hop-length has been optimized.
Pastry is efficient for the hop metric.
While selecting node ids to add to records in each row of the routing table, Pastry prefers nodes that are topologically closer to it. The lower the row number, say i, more choices are available to choose the nearest nodes from as only the first i prefixes need to match. As the row number goes up in the routing table, the available close neighbor choices decrease and hence for later hops, the latency might be more.

kademlia closest good nodes won't intersect enough between two requests

working on bep44 implementation, i use the defined kademlia algorithm to find the closest good node given an hash id.
Using my program i do go run main.go -put "Hello World!" -kname mykey -salt foobar2 -b public and get the value stored over a hundred nodes (good).
Now, when i run it multiple consecutive times, the sets of ip which are written by the put requests poorly intersects.
It is a problem as when i try to do a get request, the set of ips queried does not intersect with the put set, so the value is not found.
In my tests i use the public dht bootstrap node
"router.utorrent.com:6881",
"router.bittorrent.com:6881",
"dht.transmissionbt.com:6881",
When i query the nodes, I select the 8 closest nodes (nodes := s.ClosestGoodNodes(8, msg.InfoHash())), which usually end up in a list of ~1K queries after a recursive traversal.
In my understanding, storing addresses of the info hash in the dht table is deterministic given the status of the table. As i m doing consecutive queries i expect the table to change, indeed, but not that much.
How does it happen the store nodes set does not intersect ?
Since BEP44 is an extension it is only supported by a subset of the DHT nodes, which means the iterative lookup mechanism needs to take support into account when determining whether the set of closest nodes is stable and the lookup can be terminated.
If a node returns a token, v or seq field in in a get response then it is eligible for the closest-set of a read-only get.
If a node returns a token then it is eligible for the closest-set for a get that will be followed by put operation.
So your lookup may home in on a set of nodes in the keyspace that is closest to the target ID but not eligible for the operations in question. As long as you have candidates that are closer than the best known eligible contacts you have to continue searching. I call this perimeter widening, as it conceptually broadens the search area around the target.
Additionally you also need to take error responses or the absence of a response into account when performing put requests. You can either retry the node or try the next eligible node instead.
I have written down some additional constraints that one might want to put on the closest set in lookups for robustness and security reasons in the documentation of my own DHT implementation.
which usually end up in a list of ~1K queries after a recursive traversal.
This suggests something is wrong with your lookup algorithm. In my experience a lookup should only take somewhere between 60 and 200 udp requests to find its target if you're doing a lookup with concurrent requests, maybe even fewer when it is sequential.
Verbose logs of the terminal sets to eyeball how the lookups make progress and how much junk I am getting from peers have served me well.
In my tests i use the public dht bootstrap node
You should write your routing table to disk and reload it from there and only perform bootstrapping when none of the persisted nodes in your RT are reachable. Otherwise you are wasting the bootstrap nodes' resources and also waste time by having to re-populate your routing table first before performing any lookup.

Massive query with predicate questions

I am working in a specific project to change my repository to hazelcast.
I need find some documents by data range, store type and store ids.
During my tests i got 90k throughput using one instance c3.large, but when i execute the same test with more instances the result decrease significantly (10 instances 500k and 20 instances 700k).
These numbers were the best i could tuning some properties:
hazelcast.query.predicate.parallel.evaluation
hazelcast.operation.generic.thread.count
hz:query
I have tried to change instance to c3.2xlarge to get more processing but but the numbers don't justify the price.
How can i optimize hazelcast to be more fast in this scenario?
My user case don't use map.get(key), only map.values(predicate).
Settings:
Hazelcast 3.7.1
Map as Data Structure;
Complex object using IdentifiedDataSerializable;
Map index configured;
Only 2000 documents on map;
Hazelcast embedded configured by Spring Boot Application (singleton);
All instances in same region.
Test
Gatling
New Relic as service monitor.
Any help is welcome. Thanks.
If your use-case only contains map.values with a predicate, I would strongly suggest to use object type as in in-memory storage model. This way, there will not be any serialization involved during Query execution.
On the other end, it is normal to get very high numbers when you only have 1 member. Because, there is no data moving across network. Potentially to improve, I would check EC2 instances with high network capacity. For example c3.8xlarge has 10 Gbit network, compared to High that comes with c3.2xlarge.
I can't promise, how much increase you can get, but I would definitely try these changes first.

Circular Distributed Hash Table overlay P2P network

I think I'm missing something here or confusing terms perhaps.
What happens to the key:value pairs stored at a peer in the overlay DHT when that peer leaves the p2p network? Are they moved to the new appropriate nearest successor? Is there a standard mechanism for this if that is the case.
My understand is that the successors and predecessor peer information of adjacent peers has to be modified as expected when a peer leaves however I can't seem to find information on what happens to the actual data stored at that peer. How is the data kept complete in the DHT as peer churn occurs?
Thank you.
This usually is not part of the abstract routing algorithm that's at the core of a DHT but implementation-specific behavior instead.
Usually you will want to store the data on multiple nodes neighboring the target key that way you'll get some redundancy to handle the failures.
To keep it alive you can either have the originating node republish it in regular intervals or have the storage nodes replicate it among each other. The latter causes a bit less traffic if done properly, but is more complicated to implement.

Resources