Established Linux processes - linux

We are running a IBM MDM server (initiate) which connects through a pooling mechanism to an Oracle DB server. The configuration of pooling has been set to 32. We also have a custom java process that submits data to this MDM server through an API that MDM server exposes. Once our custom java process (which does not open any DB connections directly) terminates, we see that the number of processes between MDM server and Db server has risen to some number greater than 32. After each nightly run, we see that the number of processes keeps on increasing and finally it reached the limit set by the Oracle DB (700) and the DB wont let any more connections to be opened to it and our process fails on that night. We are trying to figure why arent the processes getting terminated and why are they being still in ESTABLISHED mode (as per netstat command)

There are several reasons the number of processes could increase and sockets in ESTABLISHED STATE.
Typical mistake is spawning a child process for each message/connect/register and not reusing the existing connection. Especially there are timer callbacks involved
e.g.,
c - register for timer callback -> server
c -> spawn a process to receive the reply and listen on receive socket
c - register for timer callback -> server -> server
c -> spawn a process to receive the reply and listen on receive socket
instead it should be
c - register for timer callback -> server
c -> spawn a process to receive the reply and listen on receive socket
c - set the initialized flag
c - register for timer callback -> server
c -> if initialized do not spawn a process to receive the reply

Does the system suffers any exception after reaching the maximum limit?
Is the process created ares still active?
Are these process establishing the DB connection but not getting terminated?
Did Top output shows the active process?
1)Clear the old logs.
2) lsof data. This is an operating system command, that will tell us
what descriptors are being used by the app server process.
lsof -p PID > lsof.out
3) ulimits. These are the operating system resource limits
ulimit -a > ulimits.out
Please check the code that opens the connection are getting closed once it is use?
Check the lsof output and the status of the connection type?
I am working for IBM as Java Service Engineer. Please answer the questions above so that we could help you better.

Related

Creating a socket inside a thread spawned by a process

I am using python 3.8 and I am trying to connect to an mqtt broker. This connection follows the path below:
Client (spawned with multiprocessing) -> thread (spawned by the client) -> thread tries to connect
I see the threads getting stuck in the socket create_connection function when the socket is created. Curious enough, if I turn things around in this way:
Client (spawned with multithreading) -> process (spawned by the client) -> process tries to connect
it works. Is there any reason why in the first case threads can't create threads which will connect to the server? I can't really debug this as all the exception are swallowed by the process
Thanks
It turned out that the process driving the threads and the threads themselves were all daemons. For some strange reason, even if you start the processes and the treads and you put a sleep after the process, threads won't connect to the server/broker even if they run. The solution is to not declare the threads as daemon and join them, then they will be capable to connect to the sever. The error wasn't clear at all because I would have expected the threads probably not to run up to that point or at least some clear indications of what was happening.

Python3.5.x deadlock detected while testing connectivity to the internet

A python3.5 application occasionally deadlocks when run.
To give more details about the application, the main thread (thread A) is responsible for receiving data from an external device and sending it to an MQTT broker.
There are two other threads that are spawned – one for checking for internet connectivity ((thread B - explained in detail below) and the other thread runs a class which implements a watchdog to check for possible deadlocks.
Thread A is long-running and sends data to the broker as received. In the case of no Internet connectivity, it starts storing data to a local database.
Thread B is used to check for Internet connectivity every 3 minutes, so that when connectivity is restored, the application can start sending the locally stored data back to the MQTT broker. This is to accommodate the scenario where the application loses Internet connectivity and starts losing data received from the device. To avoid this, the application, when offline, will start storing data locally to a SQLite3 database.
This application is run as a systemd service and internet connectivity is through the WiFi dongle, attached to the system.
Recently, we encountered a case where there was no internet connectivity (all pings were getting routed to the dongle’s IP address), and when the application tried to connect to the MQTT broker, it went into deadlock and the stack trace showed that this happened at the getaddrinfo function in socket.py.
Thread B was created to check for a successful internet connection before trying to connect to the MQTT client (to avoid deadlock).
This thread also checks on connectivity later on, when the Internet goes down, while the application is already up and running.
In this case, occasionally, the application runs into deadlock between the main thread (thread A) and Thread B
Code for Thread B shown below:
while isGoing:
try:
host = socket.gethostbyname("www.google.com")
ip = IP(host)
if ip.iptype() == 'PRIVATE':
disconnected = True
else:
disconnected = False
except Exception as e:
print(e)
disconnected = True
sleep(delay)
When thread B was monitored, it was seen to hand when using the subprocess module, os.system commands, as well as the gethostbyname function.
Note: Paho MQTT on_connect and on_disconnect callbacks are already used to check for connectivity.

When socket timeout happens? (Unix)

When I connect to a Unix named socket, under which conditions I may receive ETIMEDOUT?
If it happens when the server does not accept() during N seconds, then what are typical N on Linux?
It happens if the server's operating system doesn't accept the connection within N seconds. The server application calling accept() is not normally relevant, because the operating system performs the 3-way handshake automatically, regardless of whether the application calls accept(); the TCP stack queues up the pending connections until the application does this (up to a backlog limit).
So normally this timeout only occurs if the server is physically down or there's a communication error on the network.
I think the default on Linux is 20 seconds.

Node.js Server Timeout Problems (EC2 + Express + PM2)

I'm relatively new to running production node.js apps and I've recently been having problems with my server timing out.
Basically after a certain amount of usage & time my node.js app stops responding to requests. I don't even see routes being fired on my console anymore - it's like the whole thing just comes to a halt and the HTTP calls from my client (iPhone running AFNetworking) don't reach the server anymore. But if I restart my node.js app server everything starts working again, until things inevitable stop again. The app never crashes, it just stops responding to requests.
I'm not getting any errors, and I've made sure to handle and log all DB connection errors so I'm not sure where to start. I thought it might have something to do with memory leaks so I installed node-memwatch and set up a listener for memory leaks but that doesn't get called before my server stops responding to requests.
Any clue as to what might be happening and how I can solve this problem?
Here's my stack:
Node.js on AWS EC2 Micro Instance (using Express 4.0 + PM2)
Database on AWS RDS volume running MySQL (using node-mysql)
Sessions stored w/ Redis on same EC2 instance as the node.js app
Clients are iPhones accessing the server via AFNetworking
Once again no errors are firing with any of the modules mentioned above.
First of all you need to be a bit more specific about timeouts.
TCP timeouts: TCP divides a message into packets which are sent one by one. The receiver needs to acknowledge having received the packet. If the receiver does not acknowledge having received the package within certain period of time, a TCP retransmission occurs, which is sending the same packet again. If this happens a couple of more times, the sender gives up and kills the connection.
HTTP timeout: An HTTP client like a browser, or your server while acting as a client (e.g: sending requests to other HTTP servers), can set an arbitrary timeout. If a response is not received within that period of time, it will disconnect and call it a timeout.
Now, there are many, many possible causes for this... from more trivial to less trivial:
Wrong Content-Length calculation: If you send a request with a Content-Length: 20 header, that means "I am going to send you 20 bytes". If you send 19, the other end will wait for the remaining 1. If that takes too long... timeout.
Not enough infrastructure: Maybe you should assign more machines to your application. If (total load / # of CPU cores) is over 1, or your memory usage is high, your system may be over capacity. However keep reading...
Silent exception: An error was thrown but not logged anywhere. The request never finished processing, leading to the next item.
Resource leaks: Every request needs to be handled to completion. If you don't do this, the connection will remain open. In addition, the IncomingMesage object (aka: usually called req in express code) will remain referenced by other objects (e.g: express itself). Each one of those objects can use a lot of memory.
Node event loop starvation: I will get to that at the end.
For memory leaks, the symptoms would be:
the node process would be using an increasing amount of memory.
To make things worse, if available memory is low and your server is misconfigured to use swapping, Linux will start moving memory to disk (swapping), which is very I/O and CPU intensive. Servers should not have swapping enabled.
cat /proc/sys/vm/swappiness
will return you the level of swappiness configured in your system (goes from 0 to 100). You can modify it in a persistent way via /etc/sysctl.conf (requires restart) or in a volatile way using: sysctl vm.swappiness=10
Once you've established you have a memory leak, you need to get a core dump and download it for analysis. A way to do that can be found in this other Stackoverflow response: Tools to analyze core dump from Node.js
For connection leaks (you leaked a connection by not handling a request to completion), you would be having an increasing number of established connections to your server. You can check your established connections with netstat -a -p tcp | grep ESTABLISHED | wc -l can be used to count established connections.
Now, the event loop starvation is the worst problem. If you have short lived code node works very well. But if you do CPU intensive stuff and have a function that keeps the CPU busy for an excessive amount of time... like 50 ms (50 ms of solid, blocking, synchronous CPU time, not asynchronous code taking 50 ms), operations being handled by the event loop such as processing HTTP requests start falling behind and eventually timing out.
The way to find a CPU bottleneck is using a performance profiler. nodegrind/qcachegrind are my preferred profiling tools but others prefer flamegraphs and such. However it can be hard to run a profiler in production. Just take a development server and slam it with requests. aka: a load test. There are many tools for this.
Finally, another way to debug the problem is:
env NODE_DEBUG=tls,net node <...arguments for your app>
node has optional debug statements that are enabled through the NODE_DEBUG environment variable. Setting NODE_DEBUG to tls,net will make node emit debugging information for the tls and net modules... so basically everything being sent or received. If there's a timeout you will see where it's coming from.
Source: Experience of maintaining large deployments of node services for years.

Select system call hangs indefinitely in a n/w application.

We have a networking application, it will be used inside various scripts to communicate with other systems.
Occasionally the scripts hang on a call to our networking application. We recently experienced a hang, and I tried to debug the hung process of this particular application.
This application consists of a client and a server(a daemon), the hang occurs on client side.
Strace output showed me that it's hung on a select system call.
> strace -p 34567
select(4, [3], NULL, NULL, NULL
As you can see there's no timeout given on select call, it can block indefinitely if the file descriptor '3' is not ready for reading.
lsof output showed that fd '3' is in FIN_WAIT2 state.
> lsof -p 34567
client 34567 user 3u IPv4 55184032 TCP client-box:smar-se-port2->server:daemon (FIN_WAIT2)
Does the above information imply something? FIN_WAIT2 state? I checked on the server side(where corresponding daemon process should be running), but there are no daemon processes running on server side. My guess is the daemon ran successfully and sent the output to client, which should be available on fd '3' for reading, but the select() call on client never comes out, and still waits for something to happen!
I am not sure why it never comes out of select() call, this only happens occasionally, most of the times the application just works fine.
Any clues?
Both Server and client are SuSE Linux.
FIN_WAIT2 means your app has sent a FIN packet to the peer, but has not received a FIN from the peer yet. In TCP, a graceful close requires a FIN from both parties. The fact that the server daemon is not running means the daemon exited (or was killed) without notifying its peer (you). So your select() is waiting for packets it will no longer receive, and has to wait for the OS to invalidate the socket using an internal timeout, which can take a long time. This is the kind of situation why you should never use infinite timeouts. Use an appropriate timeout and act accordingly if the timeout elapses.

Resources