struggling with connection idle timeout settings on ubuntu with postgresql - linux

I'm a beginner at linux server configuration and I don't have much knowledges about it. I use a linux ubuntu root server for a website with a postgres database. My operation system on my PC is windows 7.
After some minutes (I'm not quite sure, how long it takes, maybe 5 minutes or so, not a lot) without doing anything I lose my connection, which is really annoying. I googled how to fix it, but didn't really found a solution, or didn't understand them.
For example I tried to update my postgresql.conf and edited this values:
#tcp_keepalives_idle
#tcp_keepalives_interval
#tcp_keepalives_count
which didn't really help. I want to have to opportunity to idle for 30 minutes, without losing the connection.
Then I read another solution:
http://www.gnugk.org/keepalive.html
I honestly didn't really understand, what those lines I have to add, are for.
Because when I check this:
sysctl -A | grep net.ipv4
it shows me:
net.ipv4.tcp_keepalive_intvl = 75
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_time = 7200
which should mean that I won't lose my connection for 2 hours, doesn't it?
I also don't really understand what does lines are for... Does that mean, that every service a client is connected, he will still be connected for 2 hours, even if he is inactive? No matter if it is for example postgresql or ftp or something?
Please help me!
Thanks!
André

Okay it seems, that I solved the problem. Allthough there is no answer here, I just want to explain my solution.
My ISP seems to break up my connection very fast, when I idle on a connection for just a few minutes. Seems to be a problem with CGN (Carrier-grade NAT).
I solved the problem, to setup keepalive packages with sysctl.
So I used those parameter values:
net.ipv4.tcp_keepalive_intvl = 60
net.ipv4.tcp_keepalive_probes = 20
net.ipv4.tcp_keepalive_time = 180
which means that after 3 minutes the first keepalive package will be sent and when there is no connection alive every minute (60 sec) a new keep alive package will be sent and this 20 times.
All in all that prevents my connection to break up.
Maybe if another one is having that issues too here, that might be a solution for it.

Related

CentOS 8 dbus-daemon stops selinux enforcing for a few minutes then returns to enforcing

I am having a problem I have spent multiple days searching for an answer to. I administer a system running CentOS 8 (yes, I know, move to another distribution - we are within the month). The problem I am having is that if I do an "ausearch -m AVC,USER_AVC,SELINUX_ERR -ts today" I find that the system's dbus-daemon is found to do this: "msg='avc: received setenforce notice (enforcing=0) exe="/usr/bin/dbus-daemon"". A few minutes later (like 2 and a half minutes), the daemon does this: " msg='avc: received setenforce notice (enforcing=1) exe="/usr/bin/dbus-daemon"".
I cannot find any references on the net to any daemon that does this and it concerns me that it may be a security failure of which our company does the best it can to eliminate. Can anyone enlighten me as to what is happening?

Benchmarking Node.JS server

I've written a Node.JS server which I would like to benchmark. It has the following components that I would like to benchmark separately:
- socket.io: how many continuous connections can it accept and process (where is the saturation point)
- redis: the same as above
- express: don't want to benchmark it
I know there is quite some (not a lot) documentation about that on the internet, but I don't like to reinvent the wheel, plus I don't want to actually spend countless hours of time trying some solution that turns out to be wrong for the job.
This is why I'm asking you guys here: what should I use to get a number/graph (whatever) on the number of simultaneous connections that server can process simultaneuosly without being bogged down? It would also be nice to monitor cpu, memory and swap of the process (yeah, yeah I know I can use countless of techniques or write my own script, but maybe something like that exists already).
I'm not looking for an answer where you'll paste a link to some solution that I already know it exists, I would like an answer in such a way, so that the person giving it has some actual experience and can really make a point or two and point me in the right direction.
Thank you
You can use ApacheBench ab to test the load that your server may take - man page
Some nice tutorials :
nixcraft/howto-performance-benchmarks-a-web-server
petefreitag/Using Apache Bench for Simple Load Testing
Usage :
$ ab -k -n 1000 -c 100 www.yourserver.com
-k - keep alive
-n N - will send N requests to the server
-c X - will send X packets concurrently

Node.js struggling with lots of concurrent connections

I'm working on a somewhat unusual application where 10k clients are precisely timed to all try to submit data at once, every 3 mins or so. This 'ab' command fairly accurately simulates one barrage in the real world:
ab -c 10000 -n 10000 -r "http://example.com/submit?data=foo"
I'm using Node.js on Ubuntu 12.4 on a rackspacecloud VPS instance to collect these submissions, however, I'm seeing some very odd behavior from Node, even when I remove all my business logic and turn the http request into a no-op.
When the test gets about 90% done, it hangs for a long period of time. Strangely, this happens consistently at 90% - for c=n=10k, at 9000; for c=n=5k, at 4500; for c=n=2k, at 1800. The test actually completes eventually, often with no errors. But both ab and node logs show continuous processing up till around 80-90% of the test run, then a long pause before completing.
When node is processing requests normally, CPU usage is typically around 50-70%. During the hang period, CPU goes up to 100%. Sometimes it stays near 0. Between the erratic CPU response and the fact that it seems unrelated to the actual number of connections (only the % complete), I do not suspect the garbage collector.
I've tried this running 'ab' on localhost and on a remote server - same effect.
I suspect something related to the TCP stack, possibly involving closing connections, but none of my configuration changes have helped. My changes:
ulimit -n 999999
When I listen(), I set the backlog to 10000
Sysctl changes are:
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.ipv4.tcp_max_orphans = 20000
net.ipv4.tcp_max_syn_backlog = 10000
net.core.somaxconn = 10000
net.core.netdev_max_backlog = 10000
I have also noticed that I tend to get this msg in the kernel logs:
TCP: Possible SYN flooding on port 80. Sending cookies. Check SNMP counters.
I'm puzzled by this msg since the TCP backlog queue should be deep enough to never overflow. If I disable syn cookies the "Sending cookies" goes to "Dropping connections".
I speculate that this is some sort of linux TCP stack tuning problem and I've read just about everything I could find on the net. Nothing I have tried seems to matter. Any advice?
Update: Tried with tcp_max_syn_backlog, somaxconn, netdev_max_backlog, and the listen() backlog param set to 50k with no change in behavior. Still produces the SYN flood warning, too.
Are you running ab on the same machine running node? If not do you have a 1G or 10G NIC? If you are, then aren't you really trying to process 20,000 open connections?
Also if you are changing net.core.somaxconn to 10,000 you have absolutely no other sockets open on that machine? If you do then 10,000 is not high enough.
Have you tried to use nodejs cluster to spread the number of open connections per process out?
I think you might find this blog post and also the previous ones useful
http://blog.caustik.com/2012/08/19/node-js-w1m-concurrent-connections/

Can't open more than 1023 sockets

I'm developing some code that is simulating network equipment. I need to run several thousand simulated "agents", and each needs to connect to a service. The problem is that after opening 1023 connections, the connects start to time out, and the whole thing comes crashing down.
The main code is in Go, but I've written a very trivial python script which reproduces the problem.
The one thing that is somewhat unusual is that we need to set the local address on the socket when we create it. This is because the equipment that the agents are connecting to expects the apparent IP to match what we say it should be. To achieve this, I have configured 10,000 virtual interfaces (eth0:1 to eth0:10000). These are assigned unique IP addresses in a private network.
The python script is just this (only runs to 2000 connnects):
import socket
i = 0
for b in range(10, 30):
for d in range(1, 100):
i += 1
ip = "1.%d.1.%d" % (b, d)
print("Conn %i %s" % (i, ip))
s = socket.create_connection(("1.6.1.1", 5060), 10, (ip, 5060))
If I remove the last argument to socket.create_connection (the source address), then I can get all 2000 connections.
The thing that is different with using a local address is that a bind must be made before the connection can be set up, so the output from this program running under strace looks like this:
Conn 1023 1.20.1.33
bind(3, {sa_family=AF_NETLINK, pid=0, groups=00000000}, 12) = 0
bind(3, {sa_family=AF_INET, sin_port=htons(5060), sin_addr=inet_addr("1.20.1.33")}, 16) = 0
connect(3, {sa_family=AF_INET, sin_port=htons(5060), sin_addr=inet_addr("1.6.1.1")}, 16) = -1 EINPROGRESS (Operation now in progress)
If I run without a local address, the AF_INET bind goes away, and it works.
So, it seems there must be some kind of limit on the number of binds that can be made. I've waded through all sorts of links about TCP tuning on Linux, and I've tried messing with tcp_tw_reuse/recycle and I've reduced the fin_timeout, and I've done other things that I can't remember.
This is running on Ubuntu Linux (11.04, kernel 2.6.38 (64 bit). It's a virtual machine on a VMWare ESX cluster.
Just before posting this, I tried running a second instances of the python script, with the additional starting at 1.30.1.1. The first script plowed through to 1023 connections, but the second one couldn't even get the first one done, indicating that the problem is related to the large number of virtual interfaces. Could some internal data structure be limited? Some max memory setting somewhere?
Can anyone think of some limit in Linux that would cause this?
Update:
This morning I decided to try an experiment. I modified the python script to use the "main" interface IP as the source IP, and ephemeral ports in the range 10000+. The script now looks like this:
import socket
i = 0
for i in range(1, 2000):
print("Conn %i" % i)
s = socket.create_connection(("1.6.1.1", 5060), 10, ("1.1.1.30", i + 10000))
This script works just fine, so this adds to my belief that the problem is related to the large number of aliased IP addresses.
What a DOH moment. I was watching the server, using netstat, and since I didn't see a large number of connects I didn't think there was a problem. But finally I wised up and checked the /var/log/kernel, in which I found this:
Mar 8 11:03:52 TestServer01 kernel: ipv4: Neighbour table overflow.
This lead me to this posting: http://www.serveradminblog.com/2011/02/neighbour-table-overflow-sysctl-conf-tunning/ which explains how to increase the limit. Bumping the thresh3 value immediately solved the problem.
You may want to look at sysctl settings related to net.ipv4 .
These settings include stuff like maxconntrack and other relevant settings you may wish to tweak.
Are you absolutely certain that the issue is not on the server side connection not closing the sockets? i.e. what does lsof -n -p of the server process show? What does plimit -p of the server process show? The server side could be tied up not being able to accept any more connections, while the client side is getting the EINPROGRESS result.
Check the ulimit for the number of open files on both sides of the connection - 1024 is too close to a ulimit level to be a coincidence.

Ideal timeout period for dns lookup

In my rails app i do a nslookup using a ruby library resolv. If the site like dgdfgdfgdfg.com is entered its talking too long to resolve. in some instance like 20 sec.(mostly for non-existent sites) Because it cause the application to slowdown.
So i though of introducing a timeout period for the dns lookup.
What will be the ideal timeout period for the dns lookup so that resolution of actual site doesnt fail. will something like 10 sec will be fine?
There's no IETF mandated value, although §6.1.3.3 of RFC 1123 suggests a value not less than 5 seconds.
Perl's Net::DNS and the command line dig utility do default to 5 seconds between retries. Some versions of the Microsoft resolver appear to default to 3 seconds.
You can run some tests among the users to find out the right number compromising responsiveness / performance.
Also you can adjust that timeout dinamically depending on the network traffic.
For example, for every sucessful resolv, you save how much time it took you to resolv it. And every hour (for example) you can calculate an average and set double of its value as timeout (Remember that "average" is, roughly speaking, "the middle"). This way if your latency is high at some point, it autoadjust itself to increase the timeout period.

Resources