How the DHT protocol works ? Are my thoughts correct? - p2p

Im trying to understand how the DHT protocol works, specially on the file-sharing-torrent world. I read many articles, but im still confused with the filename-value hash generation.
My thoughs how the dht works are the following:
Lets say im joining a p2p network and i want to share some files. For these files, hashmap keys are generated and "traveled" through the network until the nodes who are responsible for these generated keys are accessed. Then each of these nodes add in his list a record that says "the guy with the x IP address has the file that is related with the specified key.
When i search for a file, the hashmap key is generated for this file and travels the network until the node responsible for this key is found. Then this node is communicates with me and send me the IP addresses for the nodes that hosts the real data
Are my thoughs above correct??

Your thoughts are correct. This is the general idea behind DHT's.

Related

Parsing TCP payloads using a custom spec

My goal is to create a parser for TCP packets that are using a custom spec from the Options Price Reporting Authority found here but I have no idea where to start. I've never worked with low-level stuff and I'd appreciate if I got some guidance.
The huge problem is I don't have access to the actual network because it costs a huge sum per month and all I can work off is the specification. I don't even if it's possible. Do you step by step parse each byte and hope for the best? Do you first re-create some example data using the bytes in the spec and then parse it? Isn't that also difficult since (I think) that TCP spread the data to multiple blocks?
That's quite an elaborate data feed. A quick review of the spec shows that it contains enough information to write a program in either nodejs or golang to ingest it.
Getting it to work will be a big job. Your question didn't mention your level of programming skill, or of your network engineering skill. So it's hard to guess how much learning lies ahead of you to get this done.
A few things.
It's a complex enough protocol that you will need to test it with correctly formatted sample data. You need a fairly large collection of sample packets in order to mock your data feed (that is, build a fake data feed for testing purposes). While nothing is impossible, it will be very difficult to build a bug-free program to handle this data without extensive tests.
If you have a developer relationship to the publisher of the data feed, you should ask if they offer sample data for testing.
It is not a TCP / IP data feed. It is an IP multicast datagram feed. In IP multicast feeds you set up a server to listen for the incoming data packets. They use multicast to achieve the very low latencies necessary for predatory algorithmic trading.
You won't use TCP sockets to receive it, you'll use a different programming interface called UDP datagrams
If you're used to TCP's automatic recovery from errors, datagrams will be a challenge. With datagrams you cannot tell if you failed to receive data except by looking at sequence numbers. Most data feeds using IP and multicast have some provision for retransmitting data. Your spec is no exception. You must handle retransmitted data correctly or it will look like you have lots of duplicate data.
Multicast data doesn't move over the public network. You'll need a virtual private network connection to the publisher, or to co-locate your servers in a data center where the feed is available on an internal network.
There's another, operational, spec you'll need to cope with to get this data. It's called the Common IP Multicast Distribution Network Recipient Interface Specification. This spec has a primer on the multicast dealio.
You can do this. When you have made it work, you will have gained some serious skills in network programming and network engineering.
But if you just want this data, you might try to find a reseller of the data that repackages it in an easier-to-consume format. That reseller probably also imposes a delay on the data feed.

How to handle sticky-sessions with Socket.io 1.0 when behind a firewall?

I am trying to setup a POC for myself using Nginx, Node.js and Socket.io 1.0 using clustering on Rackspace. I am under the assumption that I need to use clustering because I want this to be scalable across multiple servers if needed. I want each node to have their own instance and as of now I can't see any need for each of the instances to have to talk to each other for any reason. Again as of now, I believe I need to use clustering for simply the fact that I may have many clients connecting to this server and I want it to be able to grown and shrink accordingly. My end goal is to build a little POC similar to what is shown here: https://cloud.google.com/developers/articles/real-time-gaming-with-node-js-websocket-on-gcp
I just got what I believe to be a valid setup of the new Socket.io 1.0 established, but when connecting from different devices behind my router, they are all showing the same PID in my logging and I assume this is due to the required sticky-sessioning by Socket.io. I am not sure if this is the same as the worker-process that we used to get with clustering, but again I am still trying to get my head wrapped around all this.
First I want to know if using clustering and sticky-sessions is required, since only 1 PID is issued for the same external IP, is there anyway to have each computer treated as its own instance? I do not want to send back a response that updates everyone behind that IP.
My second question is this and it may be a stupid question but i'm asking anyway :) In reading about how to get the sticky-sessions working I kept seeing people stating to "use sticky-sessions, like by IP Address". The word "like" is what got me. I seemed to have found people referring to using sticky-sessions with IP and cookies. Can you do it by anything else, such as a username, issued token or anything? My concern is if someone is playing with this on a mobile device and they switch towers, the tower will issue a new IP so in-turn a new PID would get issued and essentially that players game lost. Am I understanding this right?
Please forgive me as I am new to Node.js but thought this would be a cool way to learn node.js and clustering in the cloud. Any info or direction that anyone can provide would be of great help. Many of the tuts all seem to broadcast events to everyone but i am looking for a scalable solution where each connection can be sent events individually most fo the time. I also need to solve for a number of people behind the same firewall being treated as separate connections when the server communicates to them. Again if there is any reading or tutorials that you feel may help me with socket.io 1.0 and what I am trying to do, please reply. Thanks!
In general since you are using websockets you don't need to worry about stickiness as long as the connection does not terminate. This communication is bi-directional and the http connection is kept alive. If the connection drops the client is essentially reconnecting and starting over. So yes if anyone's ip gets renewed you will now get a new server socket.
Refer to article using-multiple-nodes where it states the requirement for XHR/JSONP long polling clients.
I don;t believe nginx has capabilities of load balancing on things like MAC address etc as per nginx load-balancing techniques.
I am thinking that you may need a solid load balancer that can use MAC addresses, virtual port ID or some headers for routing.

Securely connecting two node.js servers

I am hosting an app on Nodejitsu, and I want that app to turn pages to pdf with wkhtmltopdf, which is a binary. Alas I'm not on a big enough plan with Nodejitsu to get ssh access to install wkhtmltopdf.
What I want to do is host the pdf converter on AWS ec2. But I am unsure of a good way to connect the two node.js servers securely.
Also then when the two servers are connected, what would be the most efficient way to transfer files between these two servers.
Is there concrete design patterns for this?
It kind of depends what you need your security for, and how much you need. So I assume you have some data on your nodejitsu server and want to send that (as html) to Amazon, and receive it back (as pdf) from amazon to nodejitsu, and you want:
nobody to be able to read your document in between
nobody to tamper with your document in between
First, realise that anyone with the right access to nodejitsu or amazon (either an employee or a hacker) will be able to see your data anyways (and probably anyone with access to your database provider as well). The chances for a data leak somewhere there are imho massively larger than someone listening in / tampering on the connection in between (especially if you host both in the US, or (another) country whose government you trust (note that if the US government wants access to your data either the nodejitsu or the amazon servers are physically in the US, they will get it anyways). But, the thing is, barring some major internet disruption, the connection between nodejisu and amazon will not run through shady providers or open wifi networks.
Once you've decided you still want / need the extra layer of security (and complexity), I would say: choose the easy route. No https, no certificates, no asymmetric encryption. Just choose a shared secret (a random password, the longer the better), and just AES encrypt the data before sending it, and at the other end AES decrypt it (using the built in node crypt module). Just send your (encrypted) documents over http, and you're done. You don't even need any authorisation layer: anything received that doesn't encrypt to a valid document, can be discarded.
It is true that https would prevent some other attacks (chiefly that the shared secret probably is somewhere in your code, so also on your local machine (and github) and with asymmetric encryption you only need to store the public key in your code. But still, it will save you a lot of headaches!

How to get the first peer from a torrent-magnet link?

I've been trying to understand the torrent-magnet technology, but I can't seem to figure out how you get connected to the first peer when opening a magnet link.
When you get a magnet link like below, it contains no initial peer - only the BitTorrent Info Hash (btih) and the file name.
magnet:?xt=urn:btih:bbb6db69965af769f664b6636e7914f8735141b3&dn=ubuntu-12.04-desktop-i386.iso
According to BitTorrent & Magnets: How Do They Work? (MakeUseOf)
If you click a magnet link that does not specify a tracker (tr) the first peer will be found using DHT. Once you’ve got a peer, peer exchange kicks in too.
The DHT article on Wikipedia does not specify how to find a peer, but in the Kademlia article (upon which BitTorrent DHT is based), it says
A node that would like to join the net must first go through a bootstrap process. In this phase, the joining node needs to know the IP address and port of another node—a bootstrap node (obtained from the user, or from a stored list)—that is already participating in the Kademlia network.
But where does it know that node from? I don't see an address or anything present in the magnet link. Since it's decentralized (trackerless), I wouldn't expect it to know the node in advance. Or is the DHT in fact not decentralized?
For the most part, when you start a bittorrent client, bootstrap off of:
nodes from your last session, that were saved to disk
other peers that you have on any of the swarms you're on
There are a few well-known bootstrap nodes which clients can use if they have no other means of finding any. Essentially the only case this happens is when you install a client for the first time, and the first torrent you download is a magnet link without a tracker.
You can then hit router.utorrent.com:6881. I believe transmission, azureus and bitcomet run similar routers, and possibly other clients as well.
By "router", I mean a node that appear to behave like any other node in the DHT, but probably has a different mechanism for determining which nodes to hand out, and probably is optimized specifically for the use case of just introducing dht nodes to each other.
UPDATE: you can run your own DHT bootstrap machine, here's the source code.

How do BitTorrent magnet links work?

For the first time I used a magnet link. Curious about how it works, I looked up the specs and didn't find any answers. The wiki says xt means "exact topic" and is followed by the format (btih in this case) with a SHA1 hash. I saw base32 mentioned, knowing it's 5 bits per character and 32 characters, I found it holds exactly 160bits, which is exactly the size of the SHA1.
There's no room for an IP address or anything, it's just a SHA1. So how does the BitTorrent client find the actual file? I turned on URL Snooper to see if it visits a page (using TCP) or does a lookup or the like, but nothing happened. I have no idea how the client finds peers. How does this work?
Also, what is the hash of? Is it a hash of an array of all the file hashes together? Maybe it's a hash of the actual torrent file required (stripping certain information)?
In a VM, I tried a magnet link with uTorrent (which was freshly installed) and it managed to find peers. Where did the first peer come from? It was fresh and there were no other torrents.
A BitTorrent magnet link identifies a torrent using1 a SHA-1 or truncated SHA-256 hash value known as the "infohash". This is the same value that peers (clients) use to identify torrents when communicating with trackers or other peers. A traditional .torrent file contains a data structure with two top-level keys: announce, identifying the tracker(s) to use for the download, and info, containing the filenames and hashes for the torrent. The "infohash" is the hash of the encoded info data.
Some magnet links include trackers or web seeds, but they often don't. Your client may know nothing about the torrent except for its infohash. The first thing it needs to is find other peers who are downloading the torrent. It does this using a separate peer-to-peer network2 operating a "distributed hash table" (DHT). A DHT is a big distributed index which maps torrents (identified by infohashes) to lists of peers (identified by IP address and ports) who are participating in a swarm for that torrent (uploading/downloading data or metadata).
The first time a client joins the DHT network it generates a random 160-bit ID from the same space as infohashes. It then bootstraps its connection to the DHT network using either hard-coded addresses of clients controlled by the client developer, or DHT-supporting clients previously encountered in a torrent swarm. When it wants to participate in a swarm for a given torrent, it searches the DHT network for several other clients whose IDs are as close3 as possible to the infohash. It notifies these clients that it would like to participate in the swarm, and asks them for the connection information of any peers they already know of who are participating in the swarm.
When peers are uploading/downloading a particular torrent, they try to tell each other about all of the other peers they know of that are participating in the same torrent swarm. This lets peers know of each other quickly, without subjecting a tracker or DHT to constant requests. Once you've learned of a few peers from the DHT, your client will be able to ask those peers for the connection information of yet more peers in the torrent swarm, until you have all of the peers you need.
Finally, we can ask these peers for the torrent's info metadata, containing the filenames and hash list. Once we've downloaded this information and verified that it's correct using the known infohash, we're in practically the same position as a client that started with a regular .torrent file and got a list of peers from the included tracker.
The download may begin.
1 The infohash is typically hex-encoded, but some old clients used base 32 instead. v1 (urn:btih:) uses the SHA-1 digest directly, while v2 (urn:bimh:) adds a multihash prefix to identify the hash algorithm and digest length.
2 There are two primary DHT networks: the simpler "mainline" DHT, and a more complicated protocol used by Azureus.
3 The distance is measured by XOR.
Further Reading
BEP-3: The BitTorrent Protocol Specification
BEP-52: The BitTorrent Protocol Specification v2
BEP-5: DHT Protocol
BEP-9: Extension for Peers to Send Metadata Files
BEP-10: Extension Protocol
BEP-11: Peer Exchange (PEX)
Azureus DHT Description
Peer discovery and resource discovery (files in your case) are two different things.
I am more familiar with JXTA but all peer to peer networks work on the same basic principles.
The first thing that needs to happen is peer discovery.
Peer Discovery
Most p2p networks are "seeded" networks: when first starting a peer will connect to a well-known (hard-coded) address to retrieve a list of running peers. It can be direct seeding like connecting to dht.transmissionbt.com as mentioned in another post or indirect seeding as usually done with JXTA where the peer connects to an address that only delivers a plain text list of other peers network addresses.
Once connection is established with the first (few) peer(s), the connecting peer performs a discovery of other peers (by sending requests out) and maintains a table of them. Since the number of other peers can be huge, the connecting peer only maintains part of a Distributed Hash Table (DHT) of the peers. The algorithm to determine which part of the table the connecting peer should maintain varies depending on Network. BitTorrent uses Kademlia with 160 bit identifiers/keys.
Resource Discovery
Once a few peers have been discovered by the connecting peer, the latter sends a few requests out for discovery of resources to them. Magnet links identifies those resources and are built in such a way that they are a "signature" for a resource and guarantee that they uniquely identify the requested content among all the peers.
The connecting peer will then send a discovery request for the magnet link/resource to peers around it. The DHT is built in such a way that it helps determine which peers should be asked first for the resource (read on Kademlia in Wikipedia for more).
If the requested peer does not hold the requested resource it will usually "pass on" the query to additional peers fetched from its own DHT.
The number of "hops" the query can be passed on is usually limited; 4 is an usual number with JXTA type networks.
When a peer holds the resource, it replies with its full details. The connecting peer can then connect to the peer holding the resource (directly or via a relay - I won't go into details here) and start fetching it.
Resources/Services in P2P networks are not directly attached to network addresses: they are distributed and that is the beauty of these highly scalable networks.
I was curious by the same question myself. Reading the code for transmission, I found the following in libtrnasmission/tr-dht.c:
3248: bootstrap_from_name( "dht.transmissionbt.com", 6881,
bootstrap_af(session) );
It tries that 6 times, waiting 40(!) seconds between tries. I guess you can test it by deleting the config files (~/.config/transmission on unix), and blocking all communication to dht.transmissionbt.com, and see what happens (wait 240 seconds at least).
So it appears the client has a bootstrap node built in to start with. Of course, once it has gotten into the network, it doesn't need that bootstrap node anymore.
I finally found specification. For the first time google didnt help. (wiki linked to bittorrent.com which is the main site. I Clicked the developers link, notice the bittorrent.org tab on the right then it was easy from there. Its hard finding links when you have no idea what they are labeled and many clicks away).
It seems like all torrents have a network of peers. You find peers from trackers and you keep them between sessions. The network allows you to find peers and other things. I havent read how its used with magnet links but it seems like it is undefined how a fresh client find peers. Perhaps some is baked in, or they use their home server or known trackers embeded into the client to get the first peer in the network.
When I started answering your question, I didn't realize you were asking how the magnet scheme works. Just thought you wanted to know how the parts relevant to the bittorrent protocol were generated.
The hash listed in the magnet uri is the torrent's info hash encoded in base32. The info hash is the sha1 hash of the bencoded info block of the torrent.
This python code demonstrates how it can be calculated.
I wrote a (very naive) C# implementation to test this out since I didn't have a bencoder on hand and it matches what is expected from the client.
static string CalculateInfoHash(string path)
{
// assumes info block is last entry in dictionary
var infokey = "e4:info";
var offset = File.ReadAllText(path).IndexOf(infokey) + infokey.Length;
byte[] fileHash = File.ReadAllBytes(path).Skip(offset).ToArray();
byte[] bytes;
using (SHA1 sha1 = SHA1.Create())
bytes = sha1.ComputeHash(fileHash, 0, fileHash.Length - 1); // need to remove last 'e' to compensate for bencoding
return String.Join("", bytes.Select(b => b.ToString("X2")));
}
As I understand it, this hash does not include any information on how to locate the tracker, the client needs to find this out through other means (the announce url provided). This is just what distinguishes one torrent from another on the tracker.
Everything related to the bittorrent protocol still revolves around the tracker. It is still the primary means of communication among the swarm. The magnet uri scheme was not designed specifically for use by bittorrent. It's used by any P2P protocols as an alternative form of communicating. Bittorrent clients adapted to accept magnet links as another way to identify torrents that way you don't need to download .torrent files anymore. The magnet uri still needs to specify the tracker in order to locate it so the client may participate. It can contain information about other protocols but is irrelevant to the bittorrent protocol. The bittorrent protocol ultimately will not work without the trackers.
the list of peers are probably populated from the torrent that upgrades the client (e.g. there's a torrent for utorrent that upgrades it). as long as everyone's using the same client, it should be good because you have no choice but to share the upgrade.

Resources