A python3.5 application occasionally deadlocks when run.
To give more details about the application, the main thread (thread A) is responsible for receiving data from an external device and sending it to an MQTT broker.
There are two other threads that are spawned – one for checking for internet connectivity ((thread B - explained in detail below) and the other thread runs a class which implements a watchdog to check for possible deadlocks.
Thread A is long-running and sends data to the broker as received. In the case of no Internet connectivity, it starts storing data to a local database.
Thread B is used to check for Internet connectivity every 3 minutes, so that when connectivity is restored, the application can start sending the locally stored data back to the MQTT broker. This is to accommodate the scenario where the application loses Internet connectivity and starts losing data received from the device. To avoid this, the application, when offline, will start storing data locally to a SQLite3 database.
This application is run as a systemd service and internet connectivity is through the WiFi dongle, attached to the system.
Recently, we encountered a case where there was no internet connectivity (all pings were getting routed to the dongle’s IP address), and when the application tried to connect to the MQTT broker, it went into deadlock and the stack trace showed that this happened at the getaddrinfo function in socket.py.
Thread B was created to check for a successful internet connection before trying to connect to the MQTT client (to avoid deadlock).
This thread also checks on connectivity later on, when the Internet goes down, while the application is already up and running.
In this case, occasionally, the application runs into deadlock between the main thread (thread A) and Thread B
Code for Thread B shown below:
while isGoing:
try:
host = socket.gethostbyname("www.google.com")
ip = IP(host)
if ip.iptype() == 'PRIVATE':
disconnected = True
else:
disconnected = False
except Exception as e:
print(e)
disconnected = True
sleep(delay)
When thread B was monitored, it was seen to hand when using the subprocess module, os.system commands, as well as the gethostbyname function.
Note: Paho MQTT on_connect and on_disconnect callbacks are already used to check for connectivity.
Related
I am using python 3.8 and I am trying to connect to an mqtt broker. This connection follows the path below:
Client (spawned with multiprocessing) -> thread (spawned by the client) -> thread tries to connect
I see the threads getting stuck in the socket create_connection function when the socket is created. Curious enough, if I turn things around in this way:
Client (spawned with multithreading) -> process (spawned by the client) -> process tries to connect
it works. Is there any reason why in the first case threads can't create threads which will connect to the server? I can't really debug this as all the exception are swallowed by the process
Thanks
It turned out that the process driving the threads and the threads themselves were all daemons. For some strange reason, even if you start the processes and the treads and you put a sleep after the process, threads won't connect to the server/broker even if they run. The solution is to not declare the threads as daemon and join them, then they will be capable to connect to the sever. The error wasn't clear at all because I would have expected the threads probably not to run up to that point or at least some clear indications of what was happening.
I have my network running with 3 machines, each one with:
1 orderer
1 peer
1 ca
1 Node.Js client
They are deployed on AWS and a load balancer correctly distributes the requests to the 3 different clients.
The clients are not using discovery service, it is disabled.
Client1 only contacts orderer1 and peer1 and ca1, and so on and so forth for the others.
I want to try the high availability of Hyperledger so when I am inserting data I shutdown a machine, let's suppose machine1, and others should continue the execution.
What happens is that while the machine is down, the network stops the execution. The clients are not moving at all (they do not crash, just stop).
When I bring up the machine again, I see errors coming but it continues the execution now.
It seems like there are calls to machine 1 suspended but they recover as soon as the machine is up.
What I want is that if machine1 goes down, the requests to it are rejected and machine 2-3 continue the execution.
How to obtain it?
[EDIT] Additional information: I have inserted some logs in the client, especially in my endpoint for creation of transactions. Like this:
console.log('Starting Creation')
await contract.submitTransaction(example)
console.log('Creation done')
res.send(200)
Let me also say that this rows are encapsulated in an error handler, so that if any error occurs, I encapsulate the error.
But I get no error, I just get the first print done and the submitTransaction working for a lot of time, never receiving answers.
It seems like it tries to deliver request to orderer but orderer is not online.
When I bring down an orderer with docker service scale orderer1=0 (since I am using services with docker swarm), the orderer leader knows in the logs that he went offline. Also, if I bring the orderer up again, a new election starts.
This seems correct, in fact the problem only happens when I shutdown the machine, closing the connection in a non-friendly way.
I'm writing a socket.io based server in Node.js (6.9.0). I am using the builtin cluster module to enable multiple processes. For now, there is only two process: a master and a worker. The master receives the connections and maintains an in-memory global data structure (which the worker can query via IPC). The worker process does the majority of work by handling each incoming connection.
I am finding a hanging condition that I cannot attribute to any internal failure when the server is stressed at 300 concurrent users. Under lower concurrency, I don't see the hanging condition.
I'm enabling all forms of debugging (using the debug module: socket.io:socket, socket.io:client as well as my own custom calls to debug).
The last activity I can see is in socket.io, however, the messages indicate that sockets are closing ("reason client namespace disconnect") due to their own "end of test" cycle. It just seems like incoming connections are not be serviced.
I'm using Artillery.io as the test client.
In the server application, I have handlers for uncaught exceptions and try-catch blocks around everything.
In a prior iteration, I also used cluster, but reversed the responsibilities so that the master process handled the connections (with the worker handling global data). That didn't exhibit the same failure. Not sure if something is wrong with the connection distribution. For that, I have also dumped internalMessage events to monitor the internal workings of cluster.
I am not using any other module for connection distribution or sticky sessions. As there is only a single process handling connections (at this time), it doesn't seem relevant.
I was able to remove the hanging condition by changing the cluster scheduling policy from Round Robin (SCHED_RR) to None, which is OS specific (SCHED_NONE). I can't tell whether this is due to a bug in connection distribution (or something else inherent in the scheduling policy), but this one change seems to prevent the hanging condition.
We are running a IBM MDM server (initiate) which connects through a pooling mechanism to an Oracle DB server. The configuration of pooling has been set to 32. We also have a custom java process that submits data to this MDM server through an API that MDM server exposes. Once our custom java process (which does not open any DB connections directly) terminates, we see that the number of processes between MDM server and Db server has risen to some number greater than 32. After each nightly run, we see that the number of processes keeps on increasing and finally it reached the limit set by the Oracle DB (700) and the DB wont let any more connections to be opened to it and our process fails on that night. We are trying to figure why arent the processes getting terminated and why are they being still in ESTABLISHED mode (as per netstat command)
There are several reasons the number of processes could increase and sockets in ESTABLISHED STATE.
Typical mistake is spawning a child process for each message/connect/register and not reusing the existing connection. Especially there are timer callbacks involved
e.g.,
c - register for timer callback -> server
c -> spawn a process to receive the reply and listen on receive socket
c - register for timer callback -> server -> server
c -> spawn a process to receive the reply and listen on receive socket
instead it should be
c - register for timer callback -> server
c -> spawn a process to receive the reply and listen on receive socket
c - set the initialized flag
c - register for timer callback -> server
c -> if initialized do not spawn a process to receive the reply
Does the system suffers any exception after reaching the maximum limit?
Is the process created ares still active?
Are these process establishing the DB connection but not getting terminated?
Did Top output shows the active process?
1)Clear the old logs.
2) lsof data. This is an operating system command, that will tell us
what descriptors are being used by the app server process.
lsof -p PID > lsof.out
3) ulimits. These are the operating system resource limits
ulimit -a > ulimits.out
Please check the code that opens the connection are getting closed once it is use?
Check the lsof output and the status of the connection type?
I am working for IBM as Java Service Engineer. Please answer the questions above so that we could help you better.
I have multiple daemons (one gateway and multiple service, all running on same node) out of which some of the service daemons need to respond in "soft real time" to the arriving requests on the network, my arch is like i have a gateway daemon which routes the incoming packets based on some protocol tag to the corresponding service daemons. the service daemons process the requests and send the responses back to the gw daemon which puts on the wire. all fine and working but i am not achieving the "soft real time" and seeing a lag.
I plan to improvise on this in below way, sharing the network connection between gateway and the service daemons, i will have a notification scheme by which, when the packets arrive on the connection the gw daemon with out de-queuing the packet from the socket queue looks at the protocol header and "notifies" the corresponding service daemon that "data has arrived", on receiving the notification the service daemon grabs a binary semaphore and de-queues the data from the socket queue. there will be 2 such semaphores one for writing and the other for reading. when the service daemon needs to send data it grabs the write semaphore and sends the data. when it receives the "data arrival " notification from the gateway daemon, it grabs the read semaphore and de-queues the data from the socket. On every new connection request the gateway daemon will send the connection to the service daemons using "sendmsg".
Did any body tried this scheme any time ? do you see any problems with this approach ? pls comment/advise.
If you want to avoid copy overhead you should probably be using splice, rather than trying to share sockets between multiple daemons. That solution is going to be fiendishly difficult to debug and maintain.
I expect (and hope) that your network protocol has a header which makes it easy for the gateway to know where to route a packet to, followed by a payload destined for the service daemon.
In pseudocode the gateway does this:
while (data on socket)
{
header = read(socket, sizeof(header));
service_socket = find_service(header);
splice(socket, NULL, service_socket, NULL, header->payload_length, 0);
}