Bittorrent: Where do IP addresses come into picture?

Bittorrent: Where do IP addresses come into picture? - protocols

I am reading about the Bittorrent protocol and couldn't find this mentioned on the Wiki page. I could understand the role of trackers and publishers but from a practical perspective, I tried contacting a tracker to give me some information and it gave me the following:
7%00%00%04%82%91%F3%CA%D5%92%08%C8%7C%B0%AE%1E4%2B%E4C:0:1
Now, the long string in the beginning is perhaps the info hash. As a next step, I did this:
http://tracker.sometracker.com/announce?info_hash=7%00%00%04%82%91%F3%CA%D5%92%08%C8%7C%B0%AE%1E4%2B%E4C
It gave me back a torrent file. So far so good. The torrent file contained this:
d8:completei0e10:downloadedi0e10:incompletei2e8:intervali1931e12:min intervali965e5:peers12:U���ٿ��ӣǣ^#^#e
I went to this site: http://en.wikipedia.org/wiki/Torrent_file but couldn't find any description (or perhaps missed it). Now, if I am the client and I get this file, where do I get the list of IP addresses that have the file?

In the bencoded string that the tracker returns, the peers string is a list of peer addresses. Each address is 6 bytes -- 4 bytes for IPv4 address and 2 bytes for the port on which the peer is listening for connections.
peers12:U���ٿ��ӣǣ^#^#e contains the addresses (i.e. IPv4 and port) for 2 peers, since the peers value is 12 bytes long.
See the bittorrent wiki spec for more info.

Related

Why might Wireshark and NodeJS disagree about a packet's contents?

I'm working with raw-socket (a node module for sending raw data out on the network) and playing with their Ping example.
I have Wireshark set up to monitor traffic. I can see my ICMP packet go out, and a response comes back.
Here's where things get strange.
Wireshark shows the following packet:
IP: 4500003c69ea00004001e2fec0a85647c0a85640
ICMP: 00004b5200010a096162636465666768696a6b6c6d6e6f7071727374757677616263646566676869
However, the node event handler that fires when data comes in is showing:
IP: 4500280069ea00004001e2fec0a85647c0a85640
ICMP: 00004b5200010a096162636465666768696a6b6c6d6e6f7071727374757677616263646566676869
The ICMP components match. However, bytes 0x02 and 0x03 (the Length bytes) differ.
Wireshark shows 0x003c or 60 bytes (as expected).
Node shows 0x2800 or 10kB... which is not what is expected.
Notably, the checksum (bytes 0x18 and 0x19) are the same in each case, although it's only valid for the Wireshark packet.
So, here's the question: what might lead to this discrepancy? I'm inclined to believe Wireshark is correct since 60 bytes is the right size for an ICMP reply, but why is Node wrong here?
OSX note
The docs for this module point out that, on OSX, it will try to use SOCK_DGRAM if SOCK_RAW is not permitted. I have tried this with that function disabled and using sudo and got the same responses as before.
Github issue
It looks like https://github.com/nospaceships/node-raw-socket/issues/60 is open for this very issue, but it remains unclear if this is a code bug or a usage problem...

This is due to a FreeBSD bug (feature?) which subtracts the length of the IP header from the IP length header field and also flips it to host byte order.
https://cseweb.ucsd.edu//~braghava/notes/freebsd-sockets.txt

What are the most recent bittorrent DHT implementation recommendations?

I'm working on implementing yet another bittorrent client and at this time struggling with DHT. It is implemented accordingly to this specification http://www.bittorrent.org/beps/bep_0005.html but starting debugging it I noticed that other nodes' responses on the network vary.
For example, find_node is supposed to return either target node info or 8 closest nodes. Most of the nodes reply with 34 closest nodes and usually only 1 - 3 nodes from those 34 successfully reply to the consequent ping request.
Is there another document with better implementation recommendation? May be it is already proved that using 15 minutes interval to change the nodes state to questionable is not efficient and I have to use 10 or other number? Where can I find the best up to date suggestions?
There is another strange thing. Bootstrap nodes like router.bittorrent.com reply with even more closest nodes and usually the "nodes" BDictionary property buffer length is not divisible to 6 (compact node info: 4 for IP and 2 for port). For now, I simply cut off the buffer at the closest divisible to 6 length but all that is strange. Does anybody know why that might happen?

the spec says (emphasis mine):
When a node receives a find_node query, it should respond with a key "nodes" and value of a string containing the compact node info for [...]
Further down:
Contact information for nodes is encoded as a 26-byte string. Also known as "Compact node info" the 20-byte Node ID in network byte order has the compact IP-address/port info concatenated to the end.
Additionally you should read the original Kademlia paper since the bittorrent BEP builds on the concepts described therein and omits deeper explanations of those concepts.
You might also want to read for a few few extensions that are more or less de-facto standard for most implementations now http://libtorrent.org/dht_extensions.html
And read the other DHT-related BEPs, some are fairly widely adopted and modify/clarify BEP-5-specified behavior, but generally in a backward-compatible way.
For example, find_node is supposed to return either target node info or 8 closest nodes
Nodes will return a variable amount of entries. Could be more than 8. Or fewer.

How to reconstruct files transferred using uTP?

I have read from an article that bit torrent uses uTorrent Transport Protocol.Also as far as I understood, if I am downloading a file using bit torrent, the different pieces can come from different peers. All these packets have the same connection-id. But how can I understand the order in which these packets arrived?
For an e.g., Let P1,P2 and P3 be the peers from which I can get my file. D1 be my system. Then first portion of the file came from P2, second from P1 and third from P3. Is there any way to find which part came from from which system so that I can reconstruct the file from the captured packets?
Thank You.

The order of the individual uTP packets doesn't matter. The uTP protocol takes care of reconstructing the order of the transported stream.
It's not necessary to know from which system torrent 'piece' messages originate to reconstruct a file. By utilizing the data in the metainfo for a torrent, and 'piece' messages per the bittorrent peer protocol it's possible to create the intended files within a torrent.
To avoid confusion, I think you will benefit from knowing that uTP is a level of abstraction below the peer protocol in use with each peer.

Bittorrent : Why value of peers field is binary , not Bencoded list?

I'm trying to implement Bittorent in C. First of all, before writing a code snippet, I tried to used a web browser to send the following message(URL) to the tracker server.
you may try this URL.
http://torrent.ubuntu.com:6969/announce?info_hash=%9ea%80%ed%e7/%c4%ae%c8%de%8c%b0C%81c%fbq%3cJ%22&peer_id=M7-3-5--%eck%a8%2a%7f%e6%3ah%84%f2%9d%c5&port=43611&uploaded=0&downloaded=0&left=0&corrupt=0&key=00BA7F86&event=started&numwant=4&compact=0&no_peer_id=0
I have downloaded the torrent file from this link which is named xubuntu-13.04-desktop-i386.iso and has 9e6180ede72fc4aec8de8cb0438163fb713c4a22 as SHA-1 value.
However, after sending above request, I get
HTTP/1.0 200 OK
d8:completei357e10:incompletei8e8:intervali1800e5:peers24:l\262j"\310Հp\226\310\325G?\205^%!\221x \364\367\357e
But Bittorent specification says
peers : The value is a list of dictionaries, each with the following keys
-peer id
peer's self-selected ID, as described above for the tracker request (string)
-ip
peer's IP address (either IPv6 or IPv4) or DNS name (string)
-port
peer's port number (integer)
Why value of peers field is binary, not Bencoded list?
Thank you in advance.

the peers value may be a string consisting of multiples of 6 bytes. First 4 bytes are the IP address and last 2 bytes are the port number. All in network (big endian) notation.
https://wiki.theory.org/BitTorrentSpecification#Tracker_HTTP.2FHTTPS_Protocol

The protocol you refer to was used in the early days of bittorrent. However, as some trackers became increasingly popular without being scaled out significantly in terms of capacity, the size of the tracker responses became significant. One measure to deal with this was for clients to accept gzipped HTTP responses and the compact peer responses (which is by far the most popular format among trackers these days). The compact peer response provides a significantly smaller response with the same amount of information. It's defined in BEP23.
However, even though the responses are relatively small now, the TCP handshake and teardown still imposes a significant const, this is why many trackers are moving over to UDP BEP15.

Probability of finding TCP packets with the same payload?

I had a discussion with a developer earlier today re identifying TCP packets going out on a particular interface with the same payload. He told me that the probability of finding a TCP packet that has an equal payload (even if the same data is sent out several times) is very low due to the way TCP packets are constructed at system level. I was aware this may be the case due to the system's MTU settings (usually 1500 bytes) etc., but what sort of probability stats am I really looking at? Are there any specific protocols that would make it easier identifying matching payloads?

It is the protocol running over tcp that defines the uniqueness of the payload, not the tcp protocol itself.
For example, you might naively think that HTTP requests would all be identical when asking for a server's home page, but the referrer and user agent strings make the payloads different.
Similarly, if the response is dynamically generated, it may have a date header:
Date: Fri, 12 Sep 2008 10:44:27 GMT
So that will render the response payloads different. However, subsequent payloads may be identical, if the content is static.
Keep in mind that the actual packets will be different because of differing sequence numbers, which are supposed to be incrementing and pseudorandom.

Chris is right. More specifically, two or three pieces of information in the packet header should be different:
the sequence number (which is
intended to be unpredictable) which
is increases with the number of
bytes transmitted and received.
the timestamp, a field containing two
timestamps (although this field is optional).
the checksum, since both the payload and header are checksummed, including the changing sequence number.

EDIT: Sorry, my original idea was ridiculous.
You got me interested so I googled a little bit and found this. If you wanted to write your own tool you would probably have to inspect each payload, the easiest way would probably be some sort of hash/checksum to check for identical payloads. Just make sure you are checking the payload, not the whole packet.
As for the statistics I will have to defer to someone with greater knowledge on the workings of TCP.

Sending the same PAYLOAD is probably fairly common (particularly if you're running some sort of network service). If you mean sending out the same tcp segment (header and all) or the whole network packet (ip and up), then the probability is substantially reduced.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string