How to collect statistics from a bittorrent swarm? - statistics

I want to collect statistics from the spreading of a file in a new bittorrent swarm without actually downloading anything (or as little as possible). I need to know which peer has which pieces (to make file based statistics) knowing the number of seeders and leechers or percentages is not enough. Later when there are many peers I need to download the data to determine what it is. This part can be done with a regular torrent client.
I do not plan to implement the protocol myself so I looked at 2 implementations libtorrent and ktorrent's libbtcore. Neither is capable of collecting data while not downloading there are simply no connected peers when there is nothing to download. Libtorrent is simpler but ktorrent looks better commented.
I see 3 possibilities:
Use some application exactly for this. Are there any?
Modify a torrent implementation to do what I want. Is anyone familiar with them? Where to start?
Implement a small subset of the protocol. Just periodically ask the peers what they have. Is this feasible or would the program need to support almost the full protocol?
What do you recommend?

This is an old question, but perhaps this answer might be useful for others.
Use some application exactly for this. Are there any?
Not that I know of.
Modify a torrent implementation to do what I want. Is anyone familiar with them? Where to start?
I'm only familiar with the BitTornado core (that is used in e.g. ABC). It is written in Python, but it's an architectural mess.
However, you could just take any implementation and start stripping it from unnecessary functionality.
Implement a small subset of the protocol. Just periodically ask the peers what they have. Is this feasible or would the program need to support almost the full protocol?
Note that you cannot "ask" a peer what they have. The other peer informs you whenever it wants about the pieces it has (so it's push instead of pull). After the BitTorrent handshake, a peer may send a bitfield of pieces it has. Afterwards it may send HAVE messages informing you it has acquired a new piece. Also note that peers may lie about the pieces they have. Examples include superseeding peers and freeriding clients like BitThief.
If you want to implement a small subset of the protocol, you'd need at the bare minimum implement the BitTorrent handshake message and preferably the extended handshake message. The latter allows you to receive (and send) uTorrent PEX messages. PEX is useful to quickly discover other peers in the swarm.
For your statistics gathering purposes, you additionally need to support the bitfield and HAVE messages.

Related

How do Transfer Protocols work?

Hypothetically, lets say that I wanted to study/create (a) transfer protocol such as http, ftp or ptp. How would I go about doing so? What do I need to know about the internet and servers and what do I need to make to be able to send and receive data through my own homemade transfer protocol?
That's a little backwards.
First you have a problem you need to solve that involves multiple machines.
Then you write software to solve it, which requires communication between those machines.
The details of that communication is called a 'protocol'.
Since the protocol is the interface between machines, it's beneficial if it is generic enough to let you swap out the software on one side or the other.
In this way, HTTP was invented to serve web pages to browsers, FTP was invented to let users transfer files, etc. The details of the protocol indicate the elements of communication required to solve the problem in the desired way.

is there such thing a Bittorent passive tracking?

Hi I want to make an application that if given a torrent file (or hash) can give the number of peers without being active (i.e not responsible) in the process that allow the sharing of a file (for legal reason obviously). whether by being a "passive" (passive as define previously) tracker or a bittorrent client that counts "All time" peers (i.e. number of download for a torrent). Can it be done? I know some trackers keep track of download but I don't know if those who "seem not to" actually do as well. I look for something that can track the number of unique-ip transfers from when the torrent was added to the tracking system or something that count download (complete).
It's not possible to determine all peers just from a tracker. There can be multiple trackers for each torrent, and they may not store complete, fresh, or even truthful information. Additionally there's no obligation for peers to be honest with their trackers. There are also alternatives to centralized trackers, such as DHT and PEX. There's no guarantee that all peers are participating in the same DHT network. Peers might even establish disjoint PEX communities.
In short, you might make a best effort attempt at determining the total swarm participation for a particular torrent by checking trackers and querying DHT. But to be as thorough as the technology will allow, you'd actually have to participate in the swarm with all manner of transports and protocol extensions currently in use such as uTP and encryption, and scrape each peer for further peers and download states. Of course the BitTorrent community is familiar with such attempts to scrape data, and there a lot of security measures in place to prevent exploitation in this way. Examples include IP blocklists, and heuristics on peer behaviour.

scribe with peersim

Does anyone can provide any useful information to complete the publish/subscribe protocol "Scribe" using peersim? I already have the paper and the presentation, not very helpful for what I need, I understand the algorithm, but my problem is mainly with peersim.
I have some minor problems with handling the send and receive. I have already built a peer sampling service, over t-man. T-man always gives a list which includes the nodes, that a peer (or better myself, for each cycle) knows.

Fully Decentralized P2P?

I’m looking at creating a P2P system. During initial research, I’m reading from Peer-to-Peer – Harnessing the Power of Disruptive Technologies. That book states “a fully decentralized approach to instant messaging would not work on today's Internet.” Mostly blaming firewalls and NATs. The copyright is 2001. Is this information old or still correct?
It's still largely correct. Most users still are behind firewalls or home routers that block incoming connections. Those can be opened easier today than in 2001 (using uPnP for example, requiring little user interaction and knowledge) but most commercial end-user-targeting applications - phone (Skype, VoIP), chat (the various Messengers), remote control - are centralized solutions to circumvent firewall problems.
I would say that it is just plain wrong, both now and then. Yes, you will have many nodes that will be firewalled, however, you will also have a significant number who are not. So, if end-to-end encryption is used to protect the traffic from snooping, then you can use non-firewalled clients to act as intermediaries between two firewalled clients that want to chat.
You will need to take care, however, to spread the load around, so that a few unfirewalled clients aren't given too much load.
Skype uses a similar idea. They even allow file transfers through intermediaries, though they limit the through-put so as not to over load the middle-men.
That being said, now in 2010, it is a lot easier to punch holes in firewalls than it was in 2001, as most routers will allow you to automate the opening of ports via UPNP, so you are likely to have a larger pool of unfirewalled clients to work with.
Firewalls and NATs still commonly disrupt direct peer-to-peer communication between home-based PCs (and also between home-based PCs and corporate desktops).
They can be configured to allow particular peer-to-peer protocols, but that remains a stumbling block for most unsavvy users.
I think the original statement is no longer correct. But the field of Decentralized Computing is still in its infancy, with little serious contenders.
Read this interesting post on ZeroTier (thanks to #joehand): The State of NAT Traversal:
NAT is Traversable
In reading the Internet chatter on this subject I've been shocked by how many people don't really understand this, hence the reason this post was written. Lots of people think NAT is a show-stopper for peer to peer communication, but it isn't. More than 90% of NATs can be traversed, with most being traversable in reliable and deterministic ways.
At the end of the day anywhere from 4% (our numbers) to 8% (an older number from Google) of all traffic over a peer to peer network must be relayed to provide reliable service. Providing relaying for that small a number is fairly inexpensive, making reliable and scalable P2P networking that always works quite achievable.
I personally know of Dat Project, a decentralized data sharing toolkit (based on their hypercore protocol for P2P streaming).
From their Dat - Distributed Dataset Synchronization And Versioning paper:
Peer Connections
After the discovery phase, Dat should have a list of
potential data sources to try and contact. Dat uses
either TCP, UTP, or HTTP. UTP is designed to not
take up all available bandwidth on a network (e.g. so
that other people sharing wifi can still use the Inter-
net), and is still based on UDP so works with NAT
traversal techniques like UDP hole punching.
HTTP is supported for compatibility with static file servers and
web browser clients. Note that these are the protocols
we support in the reference Dat implementation, but
the Dat protocol itself is transport agnostic.
Furthermore you can use it with Bittorrent DHT. The paper also contains some references to other technologies that inspired Dat.
For implementation of peer discovery, see: discovery-channel
Then there is also IPFS, or 'The Interplanetary File System' which is currently best positioned to become a standard.
They have extensive documentation on their use of DHT and NAT traversal to achieve decentralized P2P.
The session messenger seem to have solved the issue with a truly decentralized p2p messenger by using a incentivized mixnet to relay and store messages. Its a fork of the Signal messenger with a mixnet added in. https://getsession.org -- whitepaper: https://getsession.org/wp-content/uploads/2020/02/Session-Whitepaper.pdf
It's very old and not correct. I believe there is a product out called Tribler (news article) which enables BitTorrent to function in a fully decentralized way.
If you want to go back a few years (even before that document) you could look at Windows. Windows networking used to function in a fully decentralized way. In some cases it still does.
UPNP is also decentralized in how it determines available devices on your local network.
In order to be decentralized you need to have a way to locate other peers. This can be done proactively by scanning the network (time consuming) or by having some means of the clients announcing that they are available.
The announcements can be simple UDP packets that get broadcast every so often to the subnet which other peers listen for. Another mechanism is broadcasting to IIRC channels (most common for command and control of botnets), etc. You might even use twitter or similar services. Use your imagination here.
Firewalls don't really play a part because they almost always leave open a few ports, such as 80 (http). Obviously you couldn't browse the network if that was closed. Now if the firewall is configured to only allow connections that originated from internal clients, then you'd have a little more work to do. But not much.
NATs are also not a concern for similiar issues.

What protocol should I use for fast command/response interactions?

I need to set up a protocol for fast command/response interactions. My instinct tells me to just knock together a simple protocol with CRLF separated ascii strings like how SMTP or POP3 works, and tunnel it through SSH/SSL if I need it to be secured.
While I could just do this, I'd prefer to build on an existing technology so people could use a friendly library rather than the socket library interface the OS gives them.
I need...
Commands and responses passing structured data back and forth. (XML, S expressions, don't care.)
The ability for the server to make unscheduled notifications to the client without being polled.
Any ideas please?
If you just want request/reply, HTTP is very simple. It's already a request/response protocol. The client and server side are widely implemented in most languages. Scaling it up is well understood.
The easiest way to use it is to send commands to the server as POST requests and for the server to send back the reply in the body of the response. You could also extend HTTP with your own verbs, but that would make it more work to take advantage of caching proxies and other infrastructure that understands HTTP.
If you want async notifications, then look at pub/sub protocols (Spread, XMPP, AMQP, JMS implementations or commercial pub/sub message brokers like TibcoRV, Tibco EMS or Websphere MQ). The protocol or implementation to pick depends on the reliability, latency and throughput needs of the system you're building. For example, is it ok for notifications to be dropped when the network is congested? What happens to notifications when a client is off-line -- do they get discarded or queued up for when the client reconnects.
AMQP sounds promising. Alternatively, I think XMPP supports much of what you want, though with quite a bit of overhead.
That said, depending on what you're trying to accomplish, a simple ad hoc protocol might be easier.
How about something like SNMP? I'm not sure if it fits exactly with the model your app uses, but it supports both async notify and pull (i.e., TRAP and GET).
That's a great question with a huge number of variables to consider, and the question only mentioned a few them: packet format, asynchronous vs. synchronized messaging, and security. There are many, many others one could think about. I suggest going through a description of the 7-layer protocol stack (OSI/ISO) and asking yourself what you need at those layers, and whether you want to build that layer or get it from somewhere else. (You seem mostly interested in layer 6 and 7, but also mentioned bits of lower layers.)
Think also about whether this is in a safety-critical application or part of a system with formal V&V. Really good, trustworthy communication systems are not easy to design; also an "underpowered" protocol can put a lot of coding burden on application to do error-recovery.
Finally, I would suggest looking at how other applications similar to yours do the job (check open source, read books, etc.) Also useful is the U.S. Patent Office database, etc; one can get great ideas just from reading the description of the communication problem they were trying to solve.

Resources