What are some good settings for seeding a ton of torrents? (>10000) - bittorrent

I'm running into a lot of trouble when trying to seed a lot of torrents( > 10k) with libtorrent.
They include:
Choking my network connection
Tracker requests timing out(libtorrent tracker error)
When using auto-manage(they go from checking to seeding very slowly, even when my active_seeding is set to unlimited.
I used to let them be automanaged, but I'd find that it makes nearly all of them unavailable.
Here are my current settings:
sessionSettings.setActiveDownloads(5);
sessionSettings.setActiveLimit(-1);
sessionSettings.setActiveSeeds(-1);
sessionSettings.setActiveDHTLimit(5);
sessionSettings.setPeerConnectTimeout(25);
sessionSettings.announceDoubleNAT(true);
sessionSettings.setUploadRateLimit(0);
sessionSettings.setDownloadRateLimit(0);
sessionSettings.setHalgOpenLimit(5);
sessionSettings.useReadCache(false);
sessionSettings.setMaxPeerlistSize(500);
My current method is to loop over all my 10k+ torrents, and run torrent.resume(). When using automanage, this basically only starts ~ 50 of the torrents, and the others start about at a rate of 1 torrent per 10 minutes, which wouldn't work. When not using automanage, it chokes my connection.
BUT, when I do only 30 of them, they all seem to seed correctly, so my next plan is try to resume() them in groupings either with a time delay, or after they've received a tracker_reply.
I tried to garner what I could from this, but don't know what my settings should be specifically:
http://blog.libtorrent.org/2012/01/seeding-a-million-torrents/
I'd really appreciate someone sharing their settings for seeding thousands of torrents,

When not using automanage, it chokes my connection.
Since you say it can run either on a hosted server or domestic internet connection then you will have not much of a choice but to throttle torrent startups. Domestic internet connections are generally behind consumer grade routers and possibly CGNAT, both of which have fairly small NAT tables that will eventually choke from concurrently established TCP connections (peer-peer connections, tracker announces) or UDP pseudo-connections (UDP trackers, µTP, DHT)
So to run many torrents at once you will have to limit all active maintenance traffic of that kind so that the torrents are only started to listen passively for incoming connections.

Related

How many long-living concurrent TCP sessions are consideer reasonable?

I am developing a TCP server in Linux which initially can handle thousands of concurrent clients, which are intendeed to be long living. However, after starting to implement some functionallity, I made a thread pool for that calls which are blocking and should be done apart, like database or disk access.
After some tests, under high load requesting "many" asynchronous functions my server starts to lag due to many tasks being enqueued, as they arrive faster than they can be processed. These tasks are solved in the nanoseconds, but there are thounsands. I do understand this is totally normal.
I could of course grow behind a load balancer or buying better servers with more cores, however, in practice and as standard in the industry, how many concurrent long-lived TCP sessions are consideer a "good" number in such a server like this one I'm describing? How can I say that the number of concurrent connections I got is "good enough"?
Unfortunately there's no a magic number to answer your question but I have some considerations for you to find your number:
First of all, every operational system has its own max number of
simultaneous connections because the port numbers are finite. So
check if you're not trespassing this number, else every new
connection will be refused by your server.
In order to identify how many simultaneous connections are okay to you, you must to establish a max time of response for your service.
Keep in mind that even having simultaneous connections, multicore CPUs etc... the response will come out by the same network and get bottle necked. Thus I advice you to do a load and a stress test over your architecture in order to find your acceptable latency limit.
TL;DR: There's no a magic answer, you should do a load and a stress test to find it.

Expected performance with getstream.io

The getstream.io documentation says that one should expect retrieving a feed in approximately 60ms. When I retrieve my feeds they contain a field named 'duration' which I take is the calculated server side processing time. This value is steadily around 10-40ms, with an average around 15ms.
The problem is, I seldomly get my feeds in less than 150ms and the average time is rather around 200-250ms and sometimes up to 300-400ms. This is the time for the getting the feed alone, no enrichment etc., and I have verified with tcpdump that the network roundtrip is low (around 25ms), and that the time is actually spent waiting for the server to respond.
I've tried to move around my application (eu-west and eu-central) but that doesn't seem to affect things much (again, network roundtrip is steadily around 25ms).
My question is - should I really expect 60ms and continue investigating, or is 200-400ms normal? On the getstream.io site it is explained that developer accounts receive "Low Priority Processing" - what does this mean in practise? How much difference could I expect with another plan?
I'm using the node js low level API.
Stream APIs use SSL to encrypt traffic. Unfortunately SSL introduces additional network I/O. Usually you need to pay for the increased latency only once because Stream HTTP APIs supports HTTP persistent connection (aka keep-alive).
Here's a Wireshark screenshot of the TCP traffic of 2 sequential API requests with keep alive disabled client side:
The 4 lines in red highlight that the TCP connection is getting closed each time. Another interesting thing is that the handshaking takes almost 100ms and it's done twice (the first bunch of lines).
After some investigation, it turns out that the library used to make API requests to Stream's APIs (request) does not have keep-alive enabled by default. Such change will be part of the library soon and is available on a development branch.
Here's a screenshot of the same two requests with keep-alive enabled (using the code from that branch):
This time there is not connection reset anymore and the second HTTP request does not do SSL handshaking.

Weird Tomcat outage, possibly related to maxConnections

In my company we experienced a serious problem today: our production server went down. Most people accessing our software via a browser were unable to get a connection, however people who had already been using the software were able to continue using it. Even our hot standby server was unable to communicate with the production server, which it does using HTTP, not even going out to the broader internet. The whole time the server was accessible via ping and ssh, and in fact was quite underloaded - it's normally running at 5% CPU load and it was even lower at this time. We do almost no disk i/o.
A few days after the problem started we have a new variation: port 443 (HTTPS) is responding but port 80 stopped responding. The server load is very low. Immediately after restarting tomcat, port 80 started responding again.
We're using tomcat7, with maxThreads="200", and using maxConnections=10000. We serve all data out of main memory, so each HTTP request completes very quickly, but we have a large number of users doing very simple interactions (this is high school subject selection). But it seems very unlikely we would have 10,000 users all with their browser open on our page at the same time.
My question has several parts:
Is it likely that the "maxConnections" parameter is the cause of our woes?
Is there any reason not to set "maxConnections" to a ridiculously high value e.g. 100,000? (i.e. what's the cost of doing so?)
Does tomcat output a warning message anywhere once it hits the "maxConnections" message? (We didn't notice anything).
Is it possible there's an OS limit we're hitting? We're using CentOS 6.4 (Linux) and "ulimit -f" says "unlimited". (Do firewalls understand the concept of Tcp/Ip connections? Could there be a limit elsewhere?)
What happens when tomcat hits the "maxConnections" limit? Does it try to close down some inactive connections? If not, why not? I don't like the idea that our server can be held to ransom by people having their browsers on it, sending the keep-alive's to keep the connection open.
But the main question is, "How do we fix our server?"
More info as requested by Stefan and Sharpy:
Our clients communicate directly with this server
TCP connections were in some cases immediately refused and in other cases timed out
The problem is evident even connecting my browser to the server within the network, or with the hot standby server - also in the same network - unable to do database replication messages which normally happens over HTTP
IPTables - yes, IPTables6 - I don't think so. Anyway, there's nothing between my browser and the server when I test after noticing the problem.
More info:
It really looked like we had solved the problem when we realised we were using the default Tomcat7 setting of BIO, which has one thread per connection, and we had maxThreads=200. In fact 'netstat -an' showed about 297 connections, which matches 200 + queue of 100. So we changed this to NIO and restarted tomcat. Unfortunately the same problem occurred the following day. It's possible we misconfigured the server.xml.
The server.xml and extract from catalina.out is available here:
https://www.dropbox.com/sh/sxgd0fbzyvuldy7/AACZWoBKXNKfXjsSmkgkVgW_a?dl=0
More info:
I did a load test. I'm able to create 500 connections from my development laptop, and do an HTTP GET 3 times on each, without any problem. Unless my load test is invalid (the Java class is also in the above link).
It's hard to tell for sure without hands-on debugging but one of the first things I would check would be the file descriptor limit (that's ulimit -n). TCP connections consume file descriptors, and depending on which implementation is in use, nio connections that do polling using SelectableChannel may eat several file descriptors per open socket.
To check if this is the cause:
Find Tomcat PIDs using ps
Check the ulimit the process runs with: cat /proc/<PID>/limits | fgrep 'open files'
Check how many descriptors are actually in use: ls /proc/<PID>/fd | wc -l
If the number of used descriptors is significantly lower than the limit, something else is the cause of your problem. But if it is equal or very close to the limit, it's this limit which is causing issues. In this case you should increase the limit in /etc/security/limits.conf for the user with whose account Tomcat is running and restart the process from a newly opened shell, check using /proc/<PID>/limits if the new limit is actually used, and see if Tomcat's behavior is improved.
While I don't have a direct answer to solve your problem, I'd like to offer my methods to find what's wrong.
Intuitively there are 3 assumptions:
If your clients hold their connections and never release, it is quite possible your server hits the max connection limit even there is no communications.
The non-responding state can also be reached via various ways such as bugs in the server-side code.
The hardware conditions should not be ignored.
To locate the cause of this problem, you'd better try to replay the scenario in a testing environment. Perform more comprehensive tests and record more detailed logs, including but not limited:
Unit tests, esp. logic blocks using transactions, threading and synchronizations.
Stress-oriented tests. Try to simulate all the user behaviors you can come up with and their combinations and test them in a massive batch mode. (ref)
More specified Logging. Trace client behaviors and analysis what happened exactly before the server stopped responding.
Replace a server machine and see if it will still happen.
The short answer:
Use the NIO connector instead of the default BIO connector
Set "maxConnections" to something suitable e.g. 10,000
Encourage users to use HTTPS so that intermediate proxy servers can't turn 100 page requests into 100 tcp connections.
Check for threads hanging due to deadlock problems, e.g. with a stack dump (kill -3)
(If applicable and if you're not already doing this, write your client app to use the one connection for multiple page requests).
The long answer:
We were using the BIO connector instead of NIO connector. The difference between the two is that BIO is "one thread per connection" and NIO is "one thread can service many connections". So increasing "maxConnections" was irrelevant if we didn't also increase "maxThreads", which we didn't, because we didn't understand the BIO/NIO difference.
To change it to NIO, put this in the element in server.xml:
protocol="org.apache.coyote.http11.Http11NioProtocol"
From what I've read, there's no benefit to using BIO so I don't know why it's the default. We were only using it because it was the default and we assumed the default settings were reasonable and we didn't want to become experts in tomcat tuning to the extent that we now have.
HOWEVER: Even after making this change, we had a similar occurrence: on the same day, HTTPS became unresponsive even while HTTP was working, and then a little later the opposite occurred. Which was a bit depressing. We checked in 'catalina.out' that in fact the NIO connector was being used, and it was. So we began a long period of analysing 'netstat' and wireshark. We noticed some periods of high spikes in the number of connections - in one case up to 900 connections when the baseline was around 70. These spikes occurred when we synchronised our databases between the main production server and the "appliances" we install at each customer site (schools). The more we did the synchronisation, the more we caused outages, which caused us to do even more synchronisations in a downward spiral.
What seems to be happening is that the NSW Education Department proxy server splits our database synchronisation traffic into multiple connections so that 1000 page requests become 1000 connections, and furthermore they are not closed properly until the TCP 4 minute timeout. The proxy server was only able to do this because we were using HTTP. The reason they do this is presumably load balancing - they thought by splitting the page requests across their 4 servers, they'd get better load balancing. When we switched to HTTPS, they are unable to do this and are forced to use just one connection. So that particular problem is eliminated - we no longer see a burst in the number of connections.
People have suggested increasing "maxThreads". In fact this would have improved things but this is not the 'proper' solution - we had the default of 200, but at any given time, hardly any of these were doing anything, in fact hardly any of these were even allocated to page requests.
I think you need to debug the application using Apache JMeter for number of connection and use Jconsole or Zabbix to look for heap space or thread dump for tomcat server.
Nio Connector of Apache tomcat can have maximum connections of 10000 but I don't think thats a good idea to provide that much connection to one instance of tomcat better way to do this is to run multiple instance of tomcat.
In my view best way for Production server: To Run Apache http server in front and point your tomcat instance to that http server using AJP connector.
Hope this helps.
Are you absolutely sure you're not hitting the maxThreads limit? Have you tried changing it?
These days browsers limit simultaneous connections to a max of 4 per hostname/ip, so if you have 50 simultaneous browsers, you could easily hit that limit. Although hopefully your webapp responds quickly enough to handle this. Long polling has become popular these days (until websockets are more prevalent), so you may have 200 long polls.
Another cause could be if you use HTTP[S] for app-to-app communication (that is, no browser involved). Sometimes app writers are sloppy and create new connections for performing multiple tasks in parallel, causing TCP and HTTP overhead. Double check that you are not getting an inflood of requests. Log files can usually help you on this, or you can use wireshark to count the number of HTTP requests or HTTP[S] connections. If possible, modify your API to handle multiple API calls in one HTTP request.
Related to the last one, if you have many HTTP/1.1 requests going across one connection, and intermediate proxy may be splitting them into multiple connections for load balancing purposes. Sounds crazy I know, but I've seen it happen.
Lastly, some crawl bots ignore the crawl delay set in robots.txt. Again, log files and/or wireshark can help you determine this.
Overall, run more experiments with more changes. maxThreads, https, etc. before jumping to conclusions with maxConnections.

Using Fleck Websocket for 10k simultaneous connections

I'm implementing a websocket-secure (wss://) service for an online game where all users will be connected to the service as long they are playing the game, this will use a high number of simultaneous connections, although the traffic won't be a big problem, as the service is used for chat, storage and notifications... not for real-time data synchronization.
I wanted to use Alchemy-Websockets, but it doesn't support TLS (wss://), so I have to look for another service like Fleck (or other).
Alchemy has been tested with high number of simultaneous connections, but I didn't find similar tests for Fleck, so I need to get some real info from users of fleck.
I know that Fleck is non-blocking and uses Async calls, but I need some real info, cuz it might be abusing threads, garbage collector, or any other aspect that won't be visible to lower number of connections.
I will use c# for the client as well, so I don't need neither hybiXX compatibility, nor fallback, I just need scalability and TLS support.
I finally added Mono support to WebSocketListener.
Check here how to run WebSocketListener in Mono.
10K connections is not little thing. WebSocketListener is asynchronous and it scales well. I have done tests with 10K connections and it should be fine.
My tests shows that WebSocketListener is almost as fast and scalable as the Microsoft one, and performs better than Fleck, Alchemy and others.
I made a test on a Windows machine with Core2Duo e8400 processor and 4 GB of ram.
The results were not encouraging as it started delaying handshakes after it reached ~1000 connections, i.e. it would take about one minute to accept a new connection.
These results were improved when i used XSockets as it reached 8000 simultaneous connections before the same thing happened.
I tried to test on a Linux VPS with Mono, but i don't have enough experience with Linux administration, and a few system settings related to TCP, etc. needed to change in order to allow high number of concurrent connections, so i could only reach ~1000 on the default settings, after that he app crashed (both Fleck test and XSocket test).
On the other hand, I tested node.js, and it seemed simpler to manage very high number of connections, as node didn't crash when reached the limits of tcp.
All the tests where echo test, the servers send the same message back to the client who sent the message and one random other connected client, and each connected client sends a random ~30 chars text message to the server on a random interval between 0 and 30 seconds.
I know my tests are not generic enough and i encourage anyone to have their own tests instead, but i just wanted to share my experience.
When we decided to try Fleck, we have implemented a wrapper for Fleck server and implemented a JavaScript client API so that we can send back acknowledgment messages back to the server. We wanted to test the performance of the server - message delivery time, percentage of lost messages etc. The results were pretty impressive for us and currently we are using Fleck in our production environment.
We have 4000 - 5000 concurrent connections during peak hours. On average 40 messages are sent per second. Acknowledged message ratio (acknowledged messages / total sent messages) never drops below 0.994. Average round-trip for messages is around 150 miliseconds (duration between server sending the message and receiving its ack). Finally, we did not have any memory related problems due to Fleck server after its heavy usage.

Can bittorrent peers handle seeding large numbers of idle torrents

I'm considering using bittorrent for a large data dissemination problem where the data source is petascale and users will want up to several terabytes. Some details
Number of torrents potentially in the millions
torrent sizes ranging from 100Mb to 100Gb
A stable set of clusters around the world capable of acting as seeders each holding a large subset of the total torrents (say 60% on average)
A relatively small number of simultaneous users (less than 100) wanting to download on average a few terabytes of data.
I expect the number of active torrents to be small compared to the total available but quality of service is important so there must be several seeders for each torrent or some mechanism for launching new seeders.
My question is can bittorrent clients handle seeding huge numbers of torrents, most of which are idle? Would I need to stripe torrents across the seeders in a cluster or could each node be seeding all torrents it has access to? Which client would do the best job? Are there any tools for managing clusters of seeders?
I am assuming that trackers can be made to scale to this level.
There are 2 main problems:
Each torrent (typically) needs to announce to a tracker periodically, this might end up using a significant amount of bandwidth.
The bittorrent client itself need to be written in a way to scale with a large number of torrents
As for the tracker traffic, let's assume you have 1 million torrents, the typical re-announce interval is 30 minutes, but some tracker has it set to 1 hour. Let's be conservative and assume your tracker uses 1 hour announce intervals. You will have to make 1 million GET requests per hour, let's say each request is 400 bytes up and 100 bytes down (assuming most responses will not contain any peers), that's about 111 kB/s up and 28 kB/s down constantly. That's not so bad, but keep in mind that TCP requires an extra round-trip for establishing connections, so that's another 40 bytes down and 40 bytes up.
This can be mitigated by only using UDP trackers. Then you would only need a single connect-message, and you can reuse the connection ID for each announce. Each announce message would then be 100 bytes, and the returned message would be a bit more compact as well, let's assume 60 bytes. That would get you 28 kB/s up and 16kB/s down, just to keep the torrents announced. For this you would need a client with decent udp tracker support (one that caches the connection ID for instance).
Not too bad, assuming that's insignificant compared to the actual data your seeds would send.
However, you don't necessarily need to stripe your torrents across separate data centers, you could also use an HTTP server to seed the torrents. All major bittorrent clients support http seeding, and you wouldn't have to worry about announcing to the tracker (the URL is burned into the .torrent itself).
As for a client that scales well with torrents, I don't know for sure, I haven't done any measurements. It should be fairly straightforward to just generate a million random torrents and try to load it up.
I have done some optimization work in libtorrent rasterbar to make it scale well with many torrents, I haven't tried millions though.
I've written a blog post on this topic, here.
You may be looking for Hekate
It's in, at best, pre-alpha right now, but it's quite nearly what you're describing.
To not collapse under the overhead of useless tracker announces and scrapes in the millions (and that in every announce interval), you have to restrict your seeding clusters to only load the current working set of items that are requested right now. Downloaders need to get (download) the .torrent file from a central place anyway, and that could trigger loading it into the seeding clusters. Alternatively, determine activity for a particcular info-hash by recognizing announces that do NOT originate from one of your seed clusters.
rTorrent has fast-resume (meaning no hashing happens when an appropriately prepared .torrent is loaded), and is controllable via xmlrpc so you can decommission idle items. That way, a .torrent download can trigger the actual data to be available for the next 24 hours, or as long as there's activity in the swarm.
The protocol allows for this, but I do not know which clients would scale to millions of torrents. In the worst case, you would have to write your own seed-only client.
The protocol feature most relevant to your use case is that, when a peer connects to another, the connecting peer is supposed to send the torrent's info-hash first. This means that a single listening TCP port could be used to seed an unlimited amount of torrents, with almost zero resources used when idle.
This can be found on The BitTorrent Protocol Specification:
If both sides don't send the same value, they sever the connection. The one possible exception is if a downloader wants to do multiple downloads over a single port, they may wait for incoming connections to give a download hash first, and respond with the same one if it's in their list.
I also found the same on this Bittorrent Protocol Specification v1.0:
The initiator of a connection is expected to transmit their handshake immediately. The recipient may wait for the initiator's handshake, if it is capable of serving multiple torrents simultaneously (torrents are uniquely identified by their info_hash).
However, there is one thing that would increase your load, and it is the tracker. With the normal tracker protocol, each client has to periodically announce to the tracker each torrent it has, together with information like how much it has uploaded. With millions of torrents, this would present a somewhat high load. If you were writing your own mass-seed-only client, a separate protocol to announce your seeders to the tracker would be a good idea.

Resources