Pika hangs when closing the consumers - python-3.x

I'm using pika on python3 to consume a queue on RabbitMQ. My consumer is supposed to stop/be killed on certain events, and thus it should be able to handle the closing of channels and connections by itself. After some running, I found out that I had a lot of "zombie consumers" (for want of a better term, they are registered consumers on the Rabbit, with their unacked messages lying around, but often do not have a matching process anymore) left hanging around.
After some experiments I found out that, when I try to run a channel.cancel() the process just hangs until something else comes to kill it and then the consumer is considered still active by Rabbit for some time (I think about 20 minutes).
My code works like this:
def do_some_work(method_frame, body):
# something happens here
if condition:
logging.info("Closing up...")
requeue = channel.cancel()
logging.info("Consumer stopped, {0} messages sent back".format(requeue))
for method_frame, properties, body in channel.consume(args.queue):
do_some_work(method_frame, body)
When condition is met, I can see the first log line and the requests to get more messages from Rabbit stop (at least this is what I can tell from Rabbit), but the process effectively hangs without closing the channels and connection, until clean ups from Rabbit and the OS happen.

Related

How can a forked node process send data to a terminal or to the parent on exit?

I am dealing with an odd problem which I couldn't find the answer to online, nor through a lot of trial and error.
In a multi-multi process cluster, forked worker processes can run arbitrarily long commands, but the parent process listens for keepalive messages sent by workers, and kills workers that are stuck for longer than X seconds.
Worker processes can asynchronously communicate with the rest of the world (using http, or process.send ipc communication), but on exit, I'd like to be able to communicate some things (typically, queued logs or error details).
Most online documentation for process.on('exit', handler) indicates usage of console.log, however it seems like forked processes don't inherit a normal stdout, and the console.log isn't a direct tty, it's a stream (the ipc stream, I presume?).
Because of this, the process exit handler doesn't let me use console.log to log extra lines (or if it does, I'm not sure where these lines end up)
I tried various combinations of fork options (silent/not silent, non-default stdio options like inherit), using fs.write to write to tty or a real file, using process.send, or but in no case, was I able to get the on-exit handler to log anywhere visible.
How can I get the forked process to successfully log on exit?
small additional points - all this testing is on unix-like systems (macos , amazon linux...) and both parent and child processes are fired with --sigint-trace so that we can get at least the top 10 stack frames of the interrupted process on exit. These frames do make it out to the terminal successfully
This was a bit of a misunderstanding about how SIGINT is handled, and I believe that it's impossible to accomplish what I want here, but I'd love to hear if someone else found a solution.
Node has its own SIGINT handler which is "more powerful" than custom SIGINT handlers - typically it interrupts infinite loops, which is extremely useful in the case where code is blocked by long-running operations.
Node allows one-upping its own SIGINT debugging capabilities by attaching a --trace-sigint flag which captures the last frames of execution.
If I understood this correctly, there are 4 cases with different behavior
No custom handler, event loop blocked
process is terminated without any further code execution. (and --trace-sigint can give a few stack traces)
No custom handler, event loop not blocked
normal exit flow, process.on('exit') event fires.
Custom handler, event loop blocked
nothing happens until event loop unblocks (if it does), then normal exit flow
Custom handler, event loop not blocked
normal exit flow.
This happens regardless of the way the process is started, and it's not a problem about pipes or exit events - in the case where the event loop is blocked and the native signal handler is in place, the process terminates without any further execution.
It would seem like there is no way to both get a forced process exit during a blocked event loop, AND still get node code to run on the same process after the native interruption to recover more information.
Given this, I believe the best way to recover information from the stuck process is to stream data out of it before it freezes (sounds obvious, but brings a lot of extra considerations in production environments).

socket.io | losing messages due to frequency and volume

I have around 5700 messages (each message is a 100x100 image as a Base64 string) which I emit from the server to the client from within a for-loop, pretty fast:
[a pretty big array].forEach((imgAsBase64) => {
io.emit('newImgFromServer', imgAsBase64)
})
The client only receives from 1700 to 3000 of them in total, before I get a:
disconnected due to = transport error
socket connected
Once the socket re-connects (and the for-loop has not ended) the emission of new messages from within the loop resumes but I have lost those previous ones forever.
How can I make sure that the client receives all of the messages every time ?
This question is an interesting example of "starving the event loop". If you're in a tight for loop for some period of time with no await in the loop, then you don't let the event loop process any other events during the duration of the for loop. If some events need to be processed during that time for things to work properly, you get problems. Read on for how that applies to this case.
Both client and server need some occasional cycles to process housekeeping pings and pongs in the socket.io protocol. If you firehose messages from one end to the other in a non-stop for loop, you can starve the ability to process those housekeeping messages and it will think that it has timed out (not received the housekeeping messages when it should have which is usually a sign of a lost or inoperative connection). In reality, the housekeeping messages are sitting in the event loop waiting to be processed, but if you never give the event loop a chance to process them, some other code running in the for loop will think that they never arrived.
So, you have to make sure you give both ends enough occasional cycles to process those housekeeping messages. The typical way to do that is to just make sure that you aren't fire hosing messages. Send N messages, then pause for a short period of time (enough time for the event loop to be able to service any incoming network events). Then send N more, pause, etc...
In addition, you could make this whole process a lot more efficient by combining a number of the Base64 strings into a single message. You can probably just put them into an array of 100 of them and send that array of 100 and repeat until they are all sent. Then, obviously change the client to expect an array of Base64 strings instead of just a single one. This will obviously result in a lot fewer messages to send (which is more efficient), but you will still need to pause every so often to let the server process things in the event loop.
Exactly how many messages to send before pausing is something that could be figured out via trial and error, but if you put 100 images into a single message and send 10 of these larger messages (which sends 1,000 images) and then pause for even just 50ms, that should be enough time for the event loop to service any inbound ack messages from socket.io to avoid the timeout. Any sort of pause using setTimeout() makes the setTimeout() get in line behind most other messages that are waiting in the event loop so even a short pause with setTimeout() tends to accomplish the goal of letting the event loop process the things that were waiting to be run.
If end-to-end time was super important, you could experiment with sending more messages at once and/or changing the pause time, but you don't want to end with a setting that is close to where you get a timeout (you want some safety factor).

ZMQ socket queue

I'm pretty new with ZMQ and I'm working with the NodeJS binding. I have an application that uses PUSH/PULL sockets. On one side I PUSH data to some nodes that through the PULL socket receive and process it. Sometimes I have to kill one or more nodes of my application, and it can happen that these nodes still have some data in the PULL socket to be processed. I don't want to lose this data, so I was wondering if there is a way to access ZMQ's PULL socket queue to check if there are still messages to be processed.
I actually couldn't find anything in the specs of ZMQ and the NodeJS binding, so maybe I'm getting the whole concept wrong.
If you kill a process then any data in that processes buffers will be lost.
Instead of killing the process forcefully, you should always find a way to allow processes to shut-down gracefully. Here, you can send a "KILL" message to the PULL socket; the process can then read that and exit when it receives it. If you can flush the socket buffer (depends if there are other processes still sending to it), you can do that and then exit when there are no more messages to read.
I'm posting the solution I found. It's not really a solution as I'm not using the ZMQ socket to check that there are no more messages in the queue, it's just a workaround/hack that came to my mind to make the thing work. I don't have time to write the queue handling by myself, so here's how I solved the problem:
Whenever the processes receive messages to process, they store a timestamp through new Date().getTime(). Whenever a process needs to be killed a kill message is sent to it. As the process receives the message, it starts a timeout with setInterval. Every x seconds (I put 10, can be more or less) the timeout fires a function that checks if the last received message is old enough (takes a timestamp, subtract this ts with the last one saved and if the result is greater that y, which in my case is 100 seconds, it is old enough). If it is, it means no more messages have been received (no more messages in the queue) so it kills the process, otherwise does nothing.

Signalling a producer task from a consumer task when working with a BlockingCollection

I have a pretty basic application that uses a Producer task and a Consumer task to work with files. It is based off the example here http://msdn.microsoft.com/en-us/library/dd267312.aspx
The basics of the program is that the Producer task enumerates the files on my hard drive and calculates their hash values and does a few other things. Once the Producer has finished working with a file, it Enques the file and the Consumer then grabs it.
The Consumer task has to connect to a remote server and attempt to upload the file. However, if the Consumer encounters an error, such as, not being able to connect to the remote server I need it to signal the Producer task that it should stop what it is doing and terminate. If the server is down, or goes down, there is no need for the Producer to continue cycling through thousands of files.
I have seen plenty of samples of signalling the Consumer task from the Producer task by using .CompleteAdding() on the BlockingCollection object but I am lost as to how to send a signal to the Producer from the Consumer that it should stop producing.
You could use a return queue. If one of the items generates an error/exception, you could load it up with error data and queue it back to the producer. The producer should TryTake() from the return queue just before generating a new item and handle any returned item appropriately. This beats using some atomic boolean by enabling the item to signal back with extended error information that could be used to decide what action to take - the producer may not always want/need to stop. Also, you could then queue up errored items to a GUI list and/or logger.
It's tempting to say that the consumer should return items anyway, whether they are errored or not, so that they can be re-used insted of creating new ones all the time. This, however, intruduces latency in detecting/acting on errors unless you use two return queues to prioritize error returns.
Oh - another thing - using the above design, if it has to stop, the producer could retain errored items in a local queue an re-issue one occasionally. If the server comes back up, (as indicated by the return of a successful item), the producer could re-issue the errored jobs from the local queue again before generating any more new ones. With care, this could make your upload system resilient to server reboots.

Single-Threaded Windows Service Delaying OnStop

I have a Windows Service (C# 4.0) that picks messages off of a private message queue and for each message sends one or more emails (typically 4 or 5 at most) based on message content.
Message volume is low so I have avoided complexity and left the service sinlge-threaded, but the emails are important so I need to ensure that on an SCM Stop Command any in-process messages/emails are processed/sent before the Stop completes.
In OnStop I am chekcing a static "inProcess" flag representing status and if it is set I am calling ServiceBase.RequestAdditionalTime(120000).
There are 2 problems:
The Stop Command completes immediately with some e-mail unsent, despite the request for 2 minutes.
Even if it worked I am only guessing at how long I should wait.
What is the best way to handle this in a single-threaded service?
Thanks for your help!
Greg
To fully answer, we'd need to see the structure of your message processing loop. But one thing I'm thinking is that the ServiceBase.RequestAdditionalTime() method is used to keep the SCM from complaining if a stop command (or pause, continue, start) takes too long, it doesn't mean your service will wait two minutes before stopping.
Thus, the only thing it truly does is keep the SCM from erroring out on a stop request, if you have a slow stop process.
See MSDN here: RequestAdditionalTime() method
What I'm wondering is if you get called in OnStop() and you set some complete flag, and the processing loop immediately exits when it sees this flag?
If you could post your code it would help me refine this answer, but from the question I wonder if you are expecting the call to wait for 2 minutes to let it process more, but you are setting something to tell the processing loop to stop. If this is not the case I can refine the answer further.
As for how long you should wait, that depends on how critical the emails are and how many are likely to be in the queue, and if they are persisted anywhere so that restarting the service would pick up where they left off.

Resources