How Distributed Hash Table in IPFS and Bittorrent prevent abuse? - bittorrent

My understanding is that IPFS and Bittorrent Mainline DHT are built on top of a Distributed hash Table (Kademlia).
They use the file hash as Kademlia key to find a list of peer that might have this file.
1- What I don't understand is if this is all decentralized who remove from the DHT peer that no longer host a file content?
2- What prevent someone from storing large amount of data for free inside the DHT?
3- What prevent someone from disrupting the network by adding large number of invalid peer for a popular file.
4- What prevent a bad actor from joining the DHT ring and not following the routing protocol thus preventing discovery message from reaching correct nodes.

Not sure why this was downvoted. These are excellent questions.
1- What I don't understand is if this is all decentralized who remove
from the DHT peer that no longer host a file content?
I think that DHT entries are regularly re-broadcast. So if a peer goes away, its DHT entries will no longer be broadcast and the network will forget about the data it provides unless some other node has it.
2- What prevent someone from storing large amount of data for free
inside the DHT?
Unless you re-publish or somebody else is interested in the data, it will vanish. The amount of data that you can store directly in a DHT entry is limited. So you can make other nodes store some of your data by putting data directly into DHT entries, but the effort outweighs the benefits.
3- What prevent someone from disrupting the network by adding large
number of invalid peer for a popular file.
I think there are some mechanisms envisioned in IPFS to protect the DHT against attacks. However, I don't think the current implementation is all that sophisticated. I don't think that current IPFS would deal well with a large scale distributed DDOS attack.
4- What prevent a bad actor from joining the DHT ring and not
following the routing protocol thus preventing discovery message from
reaching correct nodes.
I think a single node would be insufficient to do much damage, because a node will ask multiple peers. You would have to have multiple nodes to do significant damage.
But IPFS as it is now would not survive a sophisticated attack by state actors.

Related

Kademlia DHT: Consequences of duplicate Node IDs (GUID - Sybil)

In Kademlia and other DHTs, each node should be uniquely identifiable, yet nothing inherently enforces the random creation of an ID.
Thus my question: What would be the consequence of a new (adversarial) peer joining the network with an existing Node ID? Would the new (adversarial) peer be rejected, since the Node ID with an associated IP is already present in many k-buckets?
From the original paper:
Each Kademlia node has a 160-bit node ID. Node IDs are constructed as
in Chord, but to simplify this paper we assume machines just choose a
random, 160-bit identifier when joining the system.
The kademlia paper's focus is describing a routing algorithm. It does not deal with real-world concerns such as NAT or malicious nodes.
So the strategy of dealing with this kind of attack will vary between implementations. Some may use cryptographic node IDs, some may revalidate the old IP or ignore packets from different IPs, others may be completely fooled by this or oscillate between both IPs claiming the same ID.

Securing a p2p network, so that intermediate nodes do not get to access the contents of the packets being transmitted

What mechanisms exist already for designing a P2P architecture, in which the different nodes do work separately, in order to split a task (say distributed rendering of a 3D image), but unlike torrents, they don't get to see, or hijack the contents of the packets being transmitted? Only the original task requester is entitled to view the? results of the complete task.
Any working implementations on that principle already?
EDIT: Maybe I formulated the question wrongly. The idea is that even when they are able to work on the contents of the separate packets being sent, the separate nodes never get the chance to assemble the whole picture. Only the one requesting the task is supposed to do this.
If you have direct P2P connections (no "promiscuous" or "multicasting" sort of mode), the receiving peers should only "see" the data sent to them, nothing else.
If you have relay servers on the way and you are worried that they can sniff the data, I believe encryption is the way to go.
What we do is that peer A transmits data to peer B in an S/MIME envelope: the content is signed with the Private Key of Peer A and encrypted with the public Key of Peer B.
Only peer B can decrypt the data and is guaranteed that peer A actually sent the data.
This whole process is costly CPU and byte wise and may not be appropriate for your application. It also requires some form of key management infrastructure: peers need to subscribe to a community which issues certificates for instance.
But the basic idea is there: asymetric encryption with a key or shared secret encryption.

DHT in P2P systems

In a P2P system, what is a difference between:
send a query message to a known node and the node re-send a response(I mean I explicitly contact a node by sending a message to ask him somethings).
if there is a DHT which contains information about nodes and their resources(each recording contain a key that represent IP # of each node, and a list of its available resources), so if I have an access to this DHT (may be I am a member) and I know the key or the identifier of a given node, first can I look directly at the recording of this node without need to send it a message or a query(I mean I implicitly contact it)?second, if yes how? I mean how the DHT is represented physically, and how a node updates its information?
In case 1. if you are sure the remote node has the resource, then DHT is useless.
In case 2, DHT helps you locate resources. Yes, you can take a look at the DHT record (if you have any) about the remote node. It will give you an indication of whether the resource might be available on that remote node.
Typically, DHT are in memory tables, or table stored on a local small database. There are many ways to push the info to remote nodes, a common way is push the info to random nodes.

Can bittorrent peers handle seeding large numbers of idle torrents

I'm considering using bittorrent for a large data dissemination problem where the data source is petascale and users will want up to several terabytes. Some details
Number of torrents potentially in the millions
torrent sizes ranging from 100Mb to 100Gb
A stable set of clusters around the world capable of acting as seeders each holding a large subset of the total torrents (say 60% on average)
A relatively small number of simultaneous users (less than 100) wanting to download on average a few terabytes of data.
I expect the number of active torrents to be small compared to the total available but quality of service is important so there must be several seeders for each torrent or some mechanism for launching new seeders.
My question is can bittorrent clients handle seeding huge numbers of torrents, most of which are idle? Would I need to stripe torrents across the seeders in a cluster or could each node be seeding all torrents it has access to? Which client would do the best job? Are there any tools for managing clusters of seeders?
I am assuming that trackers can be made to scale to this level.
There are 2 main problems:
Each torrent (typically) needs to announce to a tracker periodically, this might end up using a significant amount of bandwidth.
The bittorrent client itself need to be written in a way to scale with a large number of torrents
As for the tracker traffic, let's assume you have 1 million torrents, the typical re-announce interval is 30 minutes, but some tracker has it set to 1 hour. Let's be conservative and assume your tracker uses 1 hour announce intervals. You will have to make 1 million GET requests per hour, let's say each request is 400 bytes up and 100 bytes down (assuming most responses will not contain any peers), that's about 111 kB/s up and 28 kB/s down constantly. That's not so bad, but keep in mind that TCP requires an extra round-trip for establishing connections, so that's another 40 bytes down and 40 bytes up.
This can be mitigated by only using UDP trackers. Then you would only need a single connect-message, and you can reuse the connection ID for each announce. Each announce message would then be 100 bytes, and the returned message would be a bit more compact as well, let's assume 60 bytes. That would get you 28 kB/s up and 16kB/s down, just to keep the torrents announced. For this you would need a client with decent udp tracker support (one that caches the connection ID for instance).
Not too bad, assuming that's insignificant compared to the actual data your seeds would send.
However, you don't necessarily need to stripe your torrents across separate data centers, you could also use an HTTP server to seed the torrents. All major bittorrent clients support http seeding, and you wouldn't have to worry about announcing to the tracker (the URL is burned into the .torrent itself).
As for a client that scales well with torrents, I don't know for sure, I haven't done any measurements. It should be fairly straightforward to just generate a million random torrents and try to load it up.
I have done some optimization work in libtorrent rasterbar to make it scale well with many torrents, I haven't tried millions though.
I've written a blog post on this topic, here.
You may be looking for Hekate
It's in, at best, pre-alpha right now, but it's quite nearly what you're describing.
To not collapse under the overhead of useless tracker announces and scrapes in the millions (and that in every announce interval), you have to restrict your seeding clusters to only load the current working set of items that are requested right now. Downloaders need to get (download) the .torrent file from a central place anyway, and that could trigger loading it into the seeding clusters. Alternatively, determine activity for a particcular info-hash by recognizing announces that do NOT originate from one of your seed clusters.
rTorrent has fast-resume (meaning no hashing happens when an appropriately prepared .torrent is loaded), and is controllable via xmlrpc so you can decommission idle items. That way, a .torrent download can trigger the actual data to be available for the next 24 hours, or as long as there's activity in the swarm.
The protocol allows for this, but I do not know which clients would scale to millions of torrents. In the worst case, you would have to write your own seed-only client.
The protocol feature most relevant to your use case is that, when a peer connects to another, the connecting peer is supposed to send the torrent's info-hash first. This means that a single listening TCP port could be used to seed an unlimited amount of torrents, with almost zero resources used when idle.
This can be found on The BitTorrent Protocol Specification:
If both sides don't send the same value, they sever the connection. The one possible exception is if a downloader wants to do multiple downloads over a single port, they may wait for incoming connections to give a download hash first, and respond with the same one if it's in their list.
I also found the same on this Bittorrent Protocol Specification v1.0:
The initiator of a connection is expected to transmit their handshake immediately. The recipient may wait for the initiator's handshake, if it is capable of serving multiple torrents simultaneously (torrents are uniquely identified by their info_hash).
However, there is one thing that would increase your load, and it is the tracker. With the normal tracker protocol, each client has to periodically announce to the tracker each torrent it has, together with information like how much it has uploaded. With millions of torrents, this would present a somewhat high load. If you were writing your own mass-seed-only client, a separate protocol to announce your seeders to the tracker would be a good idea.

Peer to Peer: Methods of Finding Peers

Are there any known methods of finding peers without using a dedicated central server?
ie: If I have peers which are disconnecting and reconnecting to the internet but getting a new IP address each time, and I want to connect to them without setting up a dedicated server to register with.
I was thinking about using peers email address to send a manifest of connected peers periodically, with some sort of timecode, negating the need for a dedicated server. This would be a fallback if none of the peers could be connected to after trying all the previously known peer addresses. But existing models of finding peers would be preferable.
There's no way around having to know at least one initial peer to discover more.
Fully P2P protocols, such as Gnutella or Gnutella2, or the simpler Overnet (made famous by Storm Worm), are based on each client having a start-up list of a few peers. These can come off a web-based automated tracker for example. The client will discover the whole network or portions of it by asking other peers for more addresses, for example when delegating a file search.
If you truly can't have any kind of a centralized resource, the best you can do is find the first peer through broadcasted messages and ultimately IP address scanning. The first approach is well-meaning but in at least 98% of cases won't yield any results. The later approach, of course, is abusing the internet, as well as illegal in most countries.
I really would rethink having some kind of a central tracker. It can be something as simple as a PHP script on a webserver (the gnutella network, today, is held up by ten-twenty such scripts, hosted by people who don't even know each other). And this sure is more lightweight than email (which, due to spam filters at the very least, would not work anyway).
In the limited case of peers within an intranet, it is possible to send a broadcast UDP message to a known port asking for peers to report back.
The BitcoinQT client uses a variety of methods to find nodes, some of them might be useful to you.
Satoshi Client Node Discovery
IRC is no longer used, but might be the most easy to implement:
As of version 0.6.x the Bitcoin client no longer uses IRC bootstrapping by default, and as of version 0.8.2 support for IRC bootstrapping has been removed completely. This documentation below is accurate for most prior versions.
In addition to learning and sharing its own address, the node learned about other node addresses via an IRC channel. See irc.cpp.
After learning its own address, a node encoded its own address into a string to be used as a nickname. Then, it randomly joined an IRC channel named between #bitcoin00 and #bitcoin99. Then it issued a WHO command. The thread read the lines as they appeared in the channel and decoded the IP addresses of other nodes in the channel. It did this in a loop, forever, until the node was shutdown.
When the client discovered an address from IRC, it set the timestamp on the address to the current time, but it used a "penalty" of 51 minutes, which means it looked like it was actually seen almost an hour earlier.
Take advantage of any existing forum where data can posted. Think secret IRC channel, embedding data in photos and posting to photo sharing sites 4chan?, any site that would allow your application to login and post data without captia logins etc.
http://chatzilla.hacksrus.com/faq/#password
Another strategy might be to embedded messages in digital currency transactions. Pick a cheap coin that's likely to hang around ... DOGE or MOON coin maybe. Build wallet functionality into your app. such that you can post micro transactions back and forth between addresses that your app controls. There would still be a miners fee, but this is only fractions of pennies. Even if they later prohibit adding metadata to transactions, you could make a transaction equivalent to your IP address in MOON, and use vanity addresses in MOON coin for your app. such that when a new node comes online it knows what to search the blockchain for -- 2daMOON%bootStr#pM3. SEND - 104.003021133 MOON IP = 104.3.21.133 not an expensive proposition.
Old question but I've been thinking about this problem myself so will ad my 2-cents. In short, a central server is not required if a node is aware of at least one valid peer. New nodes must be added to the network by any current member (e.g. invited, or node spawns another node, depending on your application).
Assuming that:
agents keep track of peers; the size of this address book and how entries are managed will depend on the nature of the system; e.g. how long peers remain connected, if peers use stable addresses
agents share peer information with other peers
at least some agents remain available for relatively long periods of time relative to frequency node connects to network to update it's address book (or nodes have stable addresses)
in addition to peer addresses, availability information is also tracked (many options here depending on your system. examples include: whether peer has a stable address, when last seen, some availability metric, content/service type information, address valid-until time if known)
new agents are initialized with at least one valid peer (doesn't have to be a central node, can be any valid node)
trust mechanisms shall be required if malicious peers are a possibility
When a peer comes online, it queries the peers in it's peer table to discover which are active and perhaps removes expired dynamic addresses. Nodes exchange peer information and may become linked themselves. This peer discovery/exchange may continue a certain number of hops or via random walk until peer list if of sufficient size and/or quality.
A few more details:
Nodes connect and share peer information with frequency related to how often node addresses change, so address book doesn't become stale and node becomes disconnected because none of it's former peers are available at their last known addresses
Nodes may need to limit the number of peers they accept, to avoid tendency towards centralization around the most stable nodes.
Nodes should be selective about the peers they keep; i.e. ones in which they are more likely to exchange data (e.g. weight based upon history)
Node links may be asymmetric or symmetric depending on the application
Three ways, off the top of my head, though you're always going to need some central server to start the connection unless you went with option 3.
Central server that maintains known list of peers, with keep-alive.
One or more central servers that maintain some common resource peers can use to discover one another, but once connected no longer need the central server as long as the peer remains connected (something like BitTorrent); can chain peered connections as well.
Port/IP scanning (strongly not recommended).
In your example, you'd still have some kind of central server where the peers would be registered; the protocol is the only difference.
To put it simply no, there is no way to do this without a central sever.
If you want to do this you simply need one or more central servers, whether by dynamic dns or not. The clients need a method to discover where they should connect to, and the only truly sensible way to do this is with your own server, in the simplest scenario it only needs to send an IP address in response.
Virtual severs can be had for around $15/month, which IMO is considerably cheaper than trying to use or abuse someone else's bandwidth.
[Edit].
To put it simply, there is another way, as follows.
Upon reflection I think what I'd do is to designate a set of peers as cluster controllers and use a dynamic DNS service to allow other peers to discover the cluster controllers.
Choose a dynamic DNS provider I'll call it myc.ath.cx (I Use http://www.dyndns.com/).
Each peer has to be capable of becoming a cluster controller. A cluster controller will contain a list of all the other peers connected.
When a peer is started it looks up myc.ath.cx and attempts to connect. If connection cannot be made within a period, say 30 seconds, it takes over the registration of the DNS entry.
Any peer wishing to discover other peers can simply query myc.ath.cx and a list will be provided
All peers are responsible for periodically downloading the list of peers, in case they need to cluster controller.
The cluster controller will periodically query the DNS entry - if has changed from it's IP address then it knows that it is no longer the cluster controller - so it will contact the cluster controller that currently has the DNS entry and provide it's list of known hosts.
The cluster controller will periodically contact hosts on the list to ensure that they are still valid.
Your method of sending email does use a dedicated server, though; the peer's email server, to be precise.
Roughly, I don't think it's possible without using some sort of dedicated storage or server (which the email approach does, albeit obliquely) UNLESS you are able to characterize the connectivity to the internet that your peers are using.
Basically, if you have a set of X number of peers, that connect for Y amount of time, and they are then off the grid for Z amount of time... essentially, you can construct a probability equation about how likely it is that the set of peers that you last contacted is still available; where that probability approaches 1 (for a given set of X, Y, and Z above), you can most likely sustain a peer-to-peer network without using storage.
Possibly more in the spirit; instead of having a "dedicated central server", use simple online free service to specify a peer list. Set up a yahoo group, or something like that; clients can automatically look it up and get a peer address from which to query a set of peers; the client can be coded with the authentication to post to the group, and can post periodically its IP address so that others can request the set of known active peers.
If you want to get really tricky, you can start using basically steganographic methods to hide peer location information. I.e. get a google search for "blah"; find the first site listed in the results that has an unprotected (no CAPTCHA) message board; find the third (or whatever) post that starts with "Indubitably" (or whatever), and find the header of the first message there, and there's the IP address of a peer. If that doesn't work, go down the list of search terms to the next one.
But that's sneaky. :-)
Could you re-use an existing dedicated server for the purpose?
I am thinking in particular of registering each of the peers with a Dynamic DNS, but if you were willing to get a bit uglier, sharing access to a known Hotmail account or Google Doc or the like.
You can either use a central directory or some sort of broadcast protocol for service discovery. Assuming that you could get them indexed by Google, you could conceive of a system whereby each peer runs a web site with some unique, rare words contained on a specific page. You could then use Google search results based on these words to identify potential peers. This would essentially be a (noisy and slow) internet broadcast.
If the page structure was a well known pattern or contained identifiable connection information for that peer, it would be easy to distinguish them in the search results. Using such a public directory leaves you open to compromised nodes in the network that is formed, but this is pretty much true of any P2P network absent some security mechanism.
Getting the web sites crawled and highly ranked by Google (or some other search engine) for your particular arcane set of search terms would be the trick. I can think of a couple of ways, but they aren't ones that I would use. For a legitimate service, I'd rather spend the money or find a free web site that could function as a directory.
What about another P2P system built specifically to track online peers of other P2P systems?
Then we reduce the problem of finding peers for any new P2P system to simply finding peers for the 'main' P2P system, which will give you the addresses of online peers for the system you're interested in using...
This is a typical use of a distributed hash table algorithm. I'd suggest looking at something like pastry. It uses a overlay network (Application layer network) on top of other layers.
Each node has a GUID which is used to route requests across the peer network.
If you're loooking for an already established central server then see the metaserver entry on page here:
http://martindevans.appspot.com/
You can register peers on there and then other peers can find them. Obviously this is a central server, but it requires no maintenance on your part.

Resources