Bittorrent : Why value of peers field is binary , not Bencoded list? - bittorrent

I'm trying to implement Bittorent in C. First of all, before writing a code snippet, I tried to used a web browser to send the following message(URL) to the tracker server.
you may try this URL.
http://torrent.ubuntu.com:6969/announce?info_hash=%9ea%80%ed%e7/%c4%ae%c8%de%8c%b0C%81c%fbq%3cJ%22&peer_id=M7-3-5--%eck%a8%2a%7f%e6%3ah%84%f2%9d%c5&port=43611&uploaded=0&downloaded=0&left=0&corrupt=0&key=00BA7F86&event=started&numwant=4&compact=0&no_peer_id=0
I have downloaded the torrent file from this link which is named xubuntu-13.04-desktop-i386.iso and has 9e6180ede72fc4aec8de8cb0438163fb713c4a22 as SHA-1 value.
However, after sending above request, I get
HTTP/1.0 200 OK
d8:completei357e10:incompletei8e8:intervali1800e5:peers24:l\262j"\310Հp\226\310\325G?\205^%!\221x \364\367\357e
But Bittorent specification says
peers : The value is a list of dictionaries, each with the following keys
-peer id
peer's self-selected ID, as described above for the tracker request (string)
-ip
peer's IP address (either IPv6 or IPv4) or DNS name (string)
-port
peer's port number (integer)
Why value of peers field is binary, not Bencoded list?
Thank you in advance.

the peers value may be a string consisting of multiples of 6 bytes. First 4 bytes are the IP address and last 2 bytes are the port number. All in network (big endian) notation.
https://wiki.theory.org/BitTorrentSpecification#Tracker_HTTP.2FHTTPS_Protocol

The protocol you refer to was used in the early days of bittorrent. However, as some trackers became increasingly popular without being scaled out significantly in terms of capacity, the size of the tracker responses became significant. One measure to deal with this was for clients to accept gzipped HTTP responses and the compact peer responses (which is by far the most popular format among trackers these days). The compact peer response provides a significantly smaller response with the same amount of information. It's defined in BEP23.
However, even though the responses are relatively small now, the TCP handshake and teardown still imposes a significant const, this is why many trackers are moving over to UDP BEP15.

Related

KRPC Protocol act weird in BEP-05

According to BEP-05 , when you start a find_node or get_peers request, you will receive the query message or K (8) good nodes closest to the target/infohash.
However, in my case ,with the bootstrap node router.utorrent.com:6881, the remote returned the 8 nodes which closest to self's nodeId. And if it is a get_peers request, it always returned 8 nodes closest to self and 7 invalid peers. But if access to some special node which redirect to near the infohash, the protocol acts normal.
weird wireshark dump
success wireshark dump
Any help would be appreciated!
You shouldn't pay too much attention to what the bootstrap nodes do as long as they allow you to populate your routing table, since that is their primary purpose.
They receive a disproportionate amount of traffic and to avoid directing undue amounts of traffic to any particular node they may deviate from the specification in a few ways that are harmless as long as only a vanishingly small fraction of the network behaves that way. There is only a single-digit number of bootstrap nodes among millions, so their behavior is negligible and should not be taken as a reference point.
It does not make sense to contact a bootstrap node via get peers either. find node queries would be the correct choice to populate your routing table. And it is only necessary to contact them in the relatively rare case where other mechanisms were not successful.

BitTorrent Optimistic Unchoke/Bandwith probing

While thinking about how BitTorrent works, a few questions come to my mind. Would appreciate if somebody can share a few possible responses.
Suppose a BitTorrent gets 50 peers from the tracker and then it establishes connection with 20 of them to form the peer-set. Is this peer-set randomly selected or based on their bandwith? (i understand that the peers which will be unchoked are selected based on their offered bandwidth) Subsequently, how is this bandwidth determined for each connection (a ping can give us the latency but not the bandwidth i assume)
The optimistic unchoke leads to the problem of free-riders in the system. Considering an unchoke might not always result in better peers, why is not possible to discard this policy at all? (I assume this policy helps peers with slow bandwith to fulfill requests, why cannot BitTorrent adopt a policy to probe the bandwith of the optimistic peer without sending data packets; and have another (maybe the 5th connection) for low-bandwidth peers so that they don't starve. This 5th channel will transmit at only a fraction of the bandwith compared to the other 4 channels) This may at least discourage free-riding?
traditionally the peers are selected at random. Some clients may have had weak biases based on previous interactions with the peers or CIDR distance. However, there is a recent proposal (which uTorrent and libtorrent implemens) suggests a consistent but uniformly distributed peer selection/priority algorithm. For more information, see this blog post. The unchoke algorithm is triggered every 15 seconds. The peers are then sorted by the number of bytes they sent during the last 15 seconds. The ones sending the most are then unchoked, and the rest are choked. So, the download rate is the 15 second average.
If you don't optimistically unchoke peers, there's no way for you to prove to them that you are better than the other peers in their unchoke set, and they will never unchoke you back. Without optimistic unchokes (also assuming you don't have the allow-fast extension), there is no way to start a download. When you first join, you won't have any pieces, you can't trade the first piece, you have to rely on being optimistically unchoked. Estimating someone's bandwidth without sending bulk data is hard and probably unreliable. Even if you got a good estimate of someones capacity, that wouldn't necessarily mean that capacity was available to you. The current mechanism is very robust in that it doesn't need to make assumptions about the network equipment between the peers (like the packet-train bandwidth estimation needs to do) and it looks at actual data.

__connect_no_cancel blocks and server gets data out of order

I have a TCP server using select to get data from a client through TCP socket.
The Server is slow in consuming data while the client is much faster. My client sends 8 bytes of data and each time it
-open a new connection
-write data
-disconnect
Because of this ( the server socket must accept many connection ) I increased the backlock value of listen to 500.
Despite this setting, at some point I can see that
-my client blocks in a pthread function called __connect_nocancel and this happens many times.
-after a while my server starts receiving data out of orders. The first data messed up is the one where the client blocks ( followed by other ).
I thought that increasing the backlog may fix this but this issue but this is not the case.
Can You help me? I am in Linux 2.6.32
Cheers
AFG
The backlog parameter of listen(2) is usually capped to some value inside the OS network stack. On Linux the default is 128.
The real problem though is, as #EJP is saying, you are totally mis-using TCP.
If ordering is important, your client must just keep a single connection open and write everything via that single connection. There are no two ways about this. TCP guarantees byte ordering withing the stream. Nothing guarantees the ordering of server-side processing of distinct connections.
It's also considerably more efficient. At present you are exchanging about eight packets for every eight bytes, which implies an overhead of up to 160 bytes.

Bittorrent: Where do IP addresses come into picture?

I am reading about the Bittorrent protocol and couldn't find this mentioned on the Wiki page. I could understand the role of trackers and publishers but from a practical perspective, I tried contacting a tracker to give me some information and it gave me the following:
7%00%00%04%82%91%F3%CA%D5%92%08%C8%7C%B0%AE%1E4%2B%E4C:0:1
Now, the long string in the beginning is perhaps the info hash. As a next step, I did this:
http://tracker.sometracker.com/announce?info_hash=7%00%00%04%82%91%F3%CA%D5%92%08%C8%7C%B0%AE%1E4%2B%E4C
It gave me back a torrent file. So far so good. The torrent file contained this:
d8:completei0e10:downloadedi0e10:incompletei2e8:intervali1931e12:min intervali965e5:peers12:U���ٿ��ӣǣ^#^#e
I went to this site: http://en.wikipedia.org/wiki/Torrent_file but couldn't find any description (or perhaps missed it). Now, if I am the client and I get this file, where do I get the list of IP addresses that have the file?
In the bencoded string that the tracker returns, the peers string is a list of peer addresses. Each address is 6 bytes -- 4 bytes for IPv4 address and 2 bytes for the port on which the peer is listening for connections.
peers12:U���ٿ��ӣǣ^#^#e contains the addresses (i.e. IPv4 and port) for 2 peers, since the peers value is 12 bytes long.
See the bittorrent wiki spec for more info.

Probability of finding TCP packets with the same payload?

I had a discussion with a developer earlier today re identifying TCP packets going out on a particular interface with the same payload. He told me that the probability of finding a TCP packet that has an equal payload (even if the same data is sent out several times) is very low due to the way TCP packets are constructed at system level. I was aware this may be the case due to the system's MTU settings (usually 1500 bytes) etc., but what sort of probability stats am I really looking at? Are there any specific protocols that would make it easier identifying matching payloads?
It is the protocol running over tcp that defines the uniqueness of the payload, not the tcp protocol itself.
For example, you might naively think that HTTP requests would all be identical when asking for a server's home page, but the referrer and user agent strings make the payloads different.
Similarly, if the response is dynamically generated, it may have a date header:
Date: Fri, 12 Sep 2008 10:44:27 GMT
So that will render the response payloads different. However, subsequent payloads may be identical, if the content is static.
Keep in mind that the actual packets will be different because of differing sequence numbers, which are supposed to be incrementing and pseudorandom.
Chris is right. More specifically, two or three pieces of information in the packet header should be different:
the sequence number (which is
intended to be unpredictable) which
is increases with the number of
bytes transmitted and received.
the timestamp, a field containing two
timestamps (although this field is optional).
the checksum, since both the payload and header are checksummed, including the changing sequence number.
EDIT: Sorry, my original idea was ridiculous.
You got me interested so I googled a little bit and found this. If you wanted to write your own tool you would probably have to inspect each payload, the easiest way would probably be some sort of hash/checksum to check for identical payloads. Just make sure you are checking the payload, not the whole packet.
As for the statistics I will have to defer to someone with greater knowledge on the workings of TCP.
Sending the same PAYLOAD is probably fairly common (particularly if you're running some sort of network service). If you mean sending out the same tcp segment (header and all) or the whole network packet (ip and up), then the probability is substantially reduced.

Resources