Concurrent networking in Scala - linux

I have a working prototype of a concurrent Scala program using Actors. I am now trying to fine tune the number of different Actors, etc..
One stage of the processing requires fetching new data via the internet. Of course, there is nothing I can really do to speed that aspect up. However, I figure if I launch a bunch of requests in parallel, I can bring down the total time. The question, therefore, is:
=> Is there a limit on concurrent networking in Scala or on Unix systems (such as max num sockets)? If so, how can I find out what it is.

In Linux, there is a limit on the number of open file descriptors each program can have open. This can be seen using the ulimit -n. There is a system-wide limit in /proc/sys/kernel/file-max.
Another limit is the number of connections that the Linux firewall can track. If you are using the iptables connection tracking firewall this value is in /proc/sys/net/netfilter/nf_conntrack_max.
Another limit is of course TCP/IP itself. You can only have 65534 connections to the same remote host and port because each connection needs a unique combination of (localIP, localPort, remoteIP, remotePort).
Regarding speeding things up via concurrent connections: it isn't as easy as just using more connections.
It depends on where the bottlenecks are. If your local connection is being fully used, adding more connections will only slow things down. If you are connecting to the same remote server and its connection is fully used, more will only slow it down.
Where you can get a benefit is when your local connection is not fully used and you are connecting to multiple remote hosts.
If you look at web browsers, you will see they have limits on how many connections will be made to the same remote server. They also have limits on how many connections will be made in total.

Related

In node.js, why one would want to use pools when connecting through node-postgres?

With node-postgres npm package, I'm given two connection options: with using Client or with using Pool.
What would be the benefit of using a Pool instead of a Client, what problem will it solve for me in the context of using node.js, which is a) async, and b) won't die and disconnect from Postgres after every HTTP request (as PHP would do, for example).
What would be the technicalities of using a single instance of Client vs using a Pool from within a single container running a node.js server? (e.g. Next.js, or Express, or whatever).
My understanding is that with server-side languages like PHP (classic sync php), Pool would benefit me by saving time on multiple re-connections. But a Node.js server connects once and maintains an open connection to Postgres, so why would I want to use a Pool?
PostgreSQL's architecture is specifically built for pooling. Its developers decided that forking a process for each connection to the database was the safest choice and this hasn't been changed since the start.
Modern middleware that sits between the client and the database (in your case node-postgres) opens and closes virtual connections while administering the "physical" connection to the Postgres database can be held as efficient as possible.
This means connection time can be reduced a lot, as closed connections are not really closed, but returned to a pool, and opening a new connection returns the same physical connection back to the pool after use, reducing the actual forking going on the database side.
Node-postgres themselves write about the pros on their website, and they recommend you always use pooling:
Connecting a new client to the PostgreSQL server requires a handshake
which can take 20-30 milliseconds. During this time passwords are
negotiated, SSL may be established, and configuration information is
shared with the client & server. Incurring this cost every time we
want to execute a query would substantially slow down our application.
The PostgreSQL server can only handle a limited number of clients at a
time. Depending on the available memory of your PostgreSQL server you
may even crash the server if you connect an unbounded number of
clients. note: I have crashed a large production PostgreSQL server
instance in RDS by opening new clients and never disconnecting them in
a python application long ago. It was not fun.
PostgreSQL can only process one query at a time on a single connected
client in a first-in first-out manner. If your multi-tenant web
application is using only a single connected client all queries among
all simultaneous requests will be pipelined and executed serially, one
after the other. No good!
https://node-postgres.com/features/pooling
I think it was clearly expressed in this snippet.
"But a Node.js server connects once and maintains an open connection to Postgres, so why would I want to use a Pool?"
Yes, but the number of simultaneous connections to the database itself is limited, and when too many browsers try to connect at the same time, the database's handling of it is not elegant. A pool can better mitigate this by virtualizing and outsourcing from the database itself the queuing and error handling that no databases are specialized in.
"What exactly is not elegant and how is it more elegant with pooling?"
A database stops responding, a connection times out, without any feedback to the end user (and even often with few clues to the server admin). The database is dependent on hardware to a higher extent than a javascript program. The risk of failure is higher. Those are my main "not elegant" arguments.
Pooling is better because:
a) As node-postgres wrote in my link above: "Incurring the cost of a db handshake every time we want to execute a query would substantially slow down our application."
b) Postgres can only process one query at a time on a single connected client (which is what Node would do without the pool) in a first-in first-out manner. All queries among all simultaneous requests will be pipelined and executed serially, one after the other. Recipe for disaster.
c) A node-based pooling component is in my opinion a better interface for enhancements, like request queuing, logging and error handling compared to a single-threaded connection.
Background:
According to Postgres themselves pooling IS needed, but deliberately not built into Postgres itself. They write:
"If you look at any graph of PostgreSQL performance with number of connections on the x axis and tps on the y access (with nothing else changing), you will see performance climb as connections rise until you hit saturation, and then you have a "knee" after which performance falls off. A lot of work has been done for version 9.2 to push that knee to the right and make the fall-off more gradual, but the issue is intrinsic -- without a built-in connection pool or at least an admission control policy, the knee and subsequent performance degradation will always be there.
The decision not to include a connection pooler inside the PostgreSQL server itself has been taken deliberately and with good reason:
In many cases you will get better performance if the connection pooler is running on a separate machine;
There is no single "right" pooling design for all needs, and having pooling outside the core server maintains flexibility;
You can get improved functionality by incorporating a connection pool into client-side software; and finally
Some client side software (like Java EE / JPA / Hibernate) always pools connections, so built-in pooling in PostgreSQL would then be wasteful duplication.
Many frameworks do the pooling in a process running on the the database server machine (to minimize latency effects from the database protocol) and accept high-level requests to run a certain function with a given set of parameters, with the entire function running as a single database transaction. This ensures that network latency or connection failures can't cause a transaction to hang while waiting for something from the network, and provides a simple way to retry any database transaction which rolls back with a serialization failure (SQLSTATE 40001 or 40P01).
Since a pooler built in to the database engine would be inferior (for the above reasons), the community has decided not to go that route."
And continue with their top reasons for performance failure with many connections to Postgres:
Disk contention. If you need to go to disk for random access (ie your data isn't cached in RAM), a large number of connections can tend to force more tables and indexes to be accessed at the same time, causing heavier seeking all over the disk. Seeking on rotating disks is massively slower than sequential access so the resulting "thrashing" can slow systems that use traditional hard drives down a lot.
RAM usage. The work_mem setting can have a big impact on performance. If it is too small, hash tables and sorts spill to disk, bitmap heap scans become "lossy", requiring more work on each page access, etc. So you want it to be big. But work_mem RAM can be allocated for each node of a query on each connection, all at the same time. So a big work_mem with a large number of connections can cause a lot of the OS cache to be periodically discarded, forcing more accesses to disk; or it could even put the system into swapping. So the more connections you have, the more you need to make a choice between slow plans and trashing cache/swapping.
Lock contention. This happens at various levels: spinlocks, LW locks, and all the locks that show up in pg_locks. As more processes compete for the spinlocks (which protect LW locks acquisition and release, which in turn protect the heavyweight and predicate lock acquisition and release) they account for a high percentage of CPU time used.
Context switches. The processor is interrupted from working on one query and has to switch to another, which involves saving state and restoring state. While the core is busy swapping states it is not doing any useful work on any query. Context switches are much cheaper than they used to be with modern CPUs and system call interfaces but are still far from free.
Cache line contention. One query is likely to be working on a particular area of RAM, and the query taking its place is likely to be working on a different area; causing data cached on the CPU chip to be discarded, only to need to be reloaded to continue the other query. Besides that the various processes will be grabbing control of cache lines from each other, causing stalls. (Humorous note, in one oprofile run of a heavily contended load, 10% of CPU time was attributed to a 1-byte noop; analysis showed that it was because it needed to wait on a cache line for the following machine code operation.)
General scaling. Some internal structures allocated based on max_connections scale at O(N^2) or O(N*log(N)). Some types of overhead which are negligible at a lower number of connections can become significant with a large number of connections.
Source

Deal with a large number of outgoing TCP connections in linux

I'm building a service that periodically connects to a large number of devices (thousands) via TCP sockets. These connections are all established at the same time, a couple of api commands are sent through each, then the sockets are closed. All devices are in the same subnet.
The trouble starts after about 1000 devices. New connections are not established. After waiting for a couple of minutes, everything is back to normal. My first guess was that the maximum number of socket connections had been reached and after reading through many similar questions and tutorials, I modified some kernel networking parameters like various cache sizes, max number of open files, tcp_tw_reuse, somaxconn. Unfortunately, it had little to no effect.
The problem does not seem to be burst-related: The first time I run the script, it works fine, but when I start it again a couple of minutes later, then I start seeing these errors. My best guess was that the number of open sockets builds up over time, possibly in the TIME_WAIT state. On the other hand, setting the tcp_tw_reuse parameter (which seems perfect for this scenario) did not have any noticeable effect. I close the sockets via pythons socket.close().
It may be important to stress that this question is not about a high-load server, but a high-load client! The connections are outgoing. I saw many server-related questions that were answered with the solutions I described above.

Socket.io: How many concurrent connections can WebSockets handle?

I am wondering if you have any data on concurrent connections to websockets? I am using Socket.io on Node.js server. How many clients can connect to socket and receive data without bringing my server down? 1000? 1000.0000?
Thanks!
This highly depends on your hardware configuration, what exactly are you doing/processing on the server side and if your system is optimized for many concurrent connections. For example on Linux machine by default you would probably first hit maximum number of opened files or other limits (which can be increased) before running into hardware resources exhaustion or similar scalability issues. Key resource may be the amount of RAM which can be allocated by your node.js program to keep concurrent connections opened and ability to receive new ones.
http://blog.caustik.com/2012/08/19/node-js-w1m-concurrent-connections/
Check this blog. We use the same principle. Previously our nodejs server will crash after 100 concurrent connections due to hardware constraints. But after we move to Amazon EC2 it is now highly scalable.

Comet and node.js - how many simultaneous connections could we expect on an EC2 server?

With a comet server running on node.js - how many simultaneous connections could we expect to get out of an EC2 server?
Anyone done this before and found a reasonable limit?
Our particular application only needs to push data to the clients fairly infrequently, it's more the max simultaneous connections per server that is a worry for us. We're looking at somewhere between 200k - 500k i think, and i'm trying to figure out if comet is going to be workable without a monstrous fleet of servers...
If you are running Linux, get to know the contents of /proc/sys/net/ipv4
In particular, net.ipv4.netfilter.ip_conntrack_max will let you increase the maximum number of open connections, but when you start plugging in really big numbers you will run into other problems. For instance you might need to reduce orphan_retries because you will statistically be more likely to have orphans. And with really big numbers, it is entirely possible that kernel lookup algorithms will slow down significantly. You need to carefully tune the TCP settings.
If I were in your shoes, I would compare at least two OSes, such as Linux and FreeBSD or OpenSolaris/Illumos.
On FreeBSD you will need to change settings in /boot/loader.conf
On OpenSolaris/Illumos you will need to read the documentation for the ndd command.

Can I Open Multiple Connections to a HTTP Server?

I’m writing a small software component in order to download resources from a web Server (IIS).
But it seems like that system's performance is not acceptable. Now I’m planning to increase the number of connection to the web server by spawning multiple threads.
My question is, can I improve performance by using multiple threads? More over dose web server allow me to spawning several simultaneous connections?
Thanks
Upul
All properly configured web servers should be able to handle multiple connections from the same source. This allows, for example, a browser to download two images from a page at once.
Some servers may place an upper limit on the number of concurrent connections it will accept from one client, but this will usually be a high number. Using up to 6 connections is normally safe.
As for whether it will actually improve performance, that depends on your situation. If you have a very fast connection to an internet backbone, and find that the speed you are getting from the remote server is not taking advantage of the speed of your connection, then in many situations multithreading can improve speed. If the speed is already maxing out the speed of your connection to the internet, or the connection of the remote server, then it can't do anything.
Web servers are do allow simultaneous connections. There should not be any problem in opening one's unless the application logic prevents opening multiple connections for same client. To further clarify my point, if you require login before downloading resources and your application does not allow multiple simultaneous logins, then you will get stuck there. There are applications which do not allow lot of connections from same source for security reasons.

Resources