I've recently been investigating and playing around with linux message queues and have come across something that I don't quite understand why it happens!
If we have two programs running that are both using msgrcv() in an infinite for loop to check for messages and then send two messages, the first program running will receive the 1st message, and the second program the 2nd message? If you keep sending messages it then alternates between each receiver.
Obviously, I understand that as soon as one program has read the message it is removed from the queue but who/how is it decided who will receive the message if they are all infinitely checking?
Any help would be appreciated!
The short answer is that the kernel decides.
The long answer is that this is handled by the do_msgrcv() call within the Linux kernel. If there is no message available, the caller gets put on a queue until a message is available. It's not guaranteed to go back and forth like you describe, since it all depends on the timing of each msgrcv() call, but in your case, it will probably behave that way virtually all of the time.
Related
I have an application in which the worker thread needs to call a method on main thread.
I am planning to send a custom message via win32api.PostThreadMessage(with message WM_USER+X), and when this message is received, some function must get executed on main thread. What I am looking is, to register a method to the corresponding WM_USER_X message?
Look at the RegisterWindowMessage function, it does pretty much exactly what you are after (provides a message number that should not collide with any other). The one downside is that the message number is then not a constant but will vary from run to run of your program, this makes the message loop somewhat more complicated but is well worth it for this sort of thing.
I'm working on what's basically a highly-available distributed message-passing system. The system receives messages from someplace over HTTP or TCP, perform various transformations on it, and then sends it to one or more destinations (also using TCP/HTTP).
The system has a requirement that all messages sent to a given destination are in-order, because some messages build on the content of previous ones. This limits us to processing the messages sequentially, which takes about 750ms per message. So if someone sends us, for example, one message every 250ms, we're forced to queue the messages behind each other. This eventually introduces intolerable delay in message processing under high load, as each message may have to wait for hundreds of other messages to be processed before it gets its turn.
In order to solve this problem, I want to be able to parallelize our message processing without breaking the requirement that we send them in-order.
We can easily scale our processing horizontally. The missing piece is a way to ensure that, even if messages are processed out-of-order, they are "resequenced" and sent to the destinations in the order in which they were received. I'm trying to find the best way to achieve that.
Apache Camel has a thing called a Resequencer that does this, and it includes a nice diagram (which I don't have enough rep to embed directly). This is exactly what I want: something that takes out-of-order messages and puts them in-order.
But, I don't want it to be written in Java, and I need the solution to be highly available (i.e. resistant to typical system failures like crashes or system restarts) which I don't think Apache Camel offers.
Our application is written in Node.js, with Redis and Postgresql for data persistence. We use the Kue library for our message queues. Although Kue offers priority queueing, the featureset is too limited for the use-case described above, so I think we need an alternative technology to work in tandem with Kue to resequence our messages.
I was trying to research this topic online, and I can't find as much information as I expected. It seems like the type of distributed architecture pattern that would have articles and implementations galore, but I don't see that many. Searching for things like "message resequencing", "out of order processing", "parallelizing message processing", etc. turn up solutions that mostly just relax the "in-order" requirements based on partitions or topics or whatnot. Alternatively, they talk about parallelization on a single machine. I need a solution that:
Can handle processing on multiple messages simultaneously in any order.
Will always send messages in the order in which they arrived in the system, no matter what order they were processed in.
Is usable from Node.js
Can operate in a HA environment (i.e. multiple instances of it running on the same message queue at once w/o inconsistencies.)
Our current plan, which makes sense to me but which I cannot find described anywhere online, is to use Redis to maintain sets of in-progress and ready-to-send messages, sorted by their arrival time. Roughly, it works like this:
When a message is received, that message is put on the in-progress set.
When message processing is finished, that message is put on the ready-to-send set.
Whenever there's the same message at the front of both the in-progress and ready-to-send sets, that message can be sent and it will be in order.
I would write a small Node library that implements this behavior with a priority-queue-esque API using atomic Redis transactions. But this is just something I came up with myself, so I am wondering: Are there other technologies (ideally using the Node/Redis stack we're already on) that are out there for solving the problem of resequencing out-of-order messages? Or is there some other term for this problem that I can use as a keyword for research? Thanks for your help!
This is a common problem, so there are surely many solutions available. This is also quite a simple problem, and a good learning opportunity in the field of distributed systems. I would suggest writing your own.
You're going to have a few problems building this, namely
2: Exactly-once delivery
1: Guaranteed order of messages
2: Exactly-once delivery
You've found number 1, and you're solving this by resequencing them in redis, which is an ok solution. The other one, however, is not solved.
It looks like your architecture is not geared towards fault tolerance, so currently, if a server craches, you restart it and continue with your life. This works fine when processing all requests sequentially, because then you know exactly when you crashed, based on what the last successfully completed request was.
What you need is either a strategy for finding out what requests you actually completed, and which ones failed, or a well-written apology letter to send to your customers when something crashes.
If Redis is not sharded, it is strongly consistent. It will fail and possibly lose all data if that single node crashes, but you will not have any problems with out-of-order data, or data popping in and out of existance. A single Redis node can thus hold the guarantee that if a message is inserted into the to-process-set, and then into the done-set, no node will see the message in the done-set without it also being in the to-process-set.
How I would do it
Using redis seems like too much fuzz, assuming that the messages are not huge, and that losing them is ok if a process crashes, and that running them more than once, or even multiple copies of a single request at the same time is not a problem.
I would recommend setting up a supervisor server that takes incoming requests, dispatches each to a randomly chosen slave, stores the responses and puts them back in order again before sending them on. You said you expected the processing to take 750ms. If a slave hasn't responded within say 2 seconds, dispatch it again to another node randomly within 0-1 seconds. The first one responding is the one we're going to use. Beware of duplicate responses.
If the retry request also fails, double the maximum wait time. After 5 failures or so, each waiting up to twice (or any multiple greater than one) as long as the previous one, we probably have a permanent error, so we should probably ask for human intervention. This algorithm is called exponential backoff, and prevents a sudden spike in requests from taking down the entire cluster. Not using a random interval, and retrying after n seconds would probably cause a DOS-attack every n seconds until the cluster dies, if it ever gets a big enough load spike.
There are many ways this could fail, so make sure this system is not the only place data is stored. However, this will probably work 99+% of the time, it's probably at least as good as your current system, and you can implement it in a few hundred lines of code. Just make sure your supervisor is using asynchronous requests so that you can handle retries and timeouts. Javascript is by nature single-threaded, so this is slightly trickier than normal, but I'm confident you can do it.
Okay SO is warning me about a subjective title so please let me explain. Right now I'm looking at Go, I've read the spec, watched a few IO talks, it looks interesting but I have some questions.
One of my favourite examples was this select statement that listened to a channel that came from "DoAfter()" or something, the channel would send something at a given time from now.
Something like this (this probably wont work, pseudo-go if anything!)
to := Time.DoAfter(1000 * Time.MS)
select:
case <-to:
return nil //we timed out
case d := <-waitingfor:
return d
Suppose the thing we're waiting for happens really fast, so this function returns and isn't listening to to any more, what happens in DoAfter?
I like and know that you ought not test the channel, for example
if(chanToSendTimeOutOn.isOpen()) {
chanToSendTimeOutOn<-true
}
I like how channels sync places, with this for example it is possible that the function above could return after the isOpen() test but before the sending of true. I really am against the test, this avoids what channels do - hide locks and whatnot.
I've read the spec and seen the run time panics and recovery, but in this example where do we recover? Is the thing waiting to send the timeout a go routine or an "object" of sorts? I imagined this "object" which had a sorted list of things it had to send things to after given times, and that it'd just append TimeAfter requests to the queue in the right order and go through it. I'm not sure where that'd get an opportunity to actually recover.
If it spawned go-routines each with their own timer (managed by the run-time of course, so threads don't actually block for time) what then would get the chance to recover?
The other part of my question is about the lifetime of channels, I would imagine they're ref counted, well those able to read are ref-counted, so that if nothing anywhere holds a readable reference it is destroyed. I'd call this deterministic. For the "point-to-point" topologies you can form it will be if you stick towards Go's "send stuff via channels, don't access it"
So here for example, when the thing that wants a timeout returns the to channel is no longer read by anyone. So the go-routine is pointless now, is there a way to make it return without doing work?
Example:
File-reading go routine that has used defer to close the file when it is done, can it "sense" the channel it is supposed to send stuff to has closed, and thus return without reading any more?
I'd also like to know why the select statement is "nondeterministic" I'd have quite liked it if the first case took priority if the first and second are ready (for a non-blocking operation) - I wont condemn it for that, but is there a reason? What's the implementation of this?
Lastly, how are go-routines scheduled? Does the compiler add some sort of "yielding" every so many instructions, so a thread running will switch between different goroutines? Where can I find info on the lower level stuff?
I know Go touts that "you simply don't need to worry about this" but I like to know what things I write actually hide (that could be a C++ thing) and the reasons why.
If you write to a closed channel, your program will panic (see http://play.golang.org/p/KU7MLrFQSx for example). You could potentially catch this error with recover, but being in a situation where you don't know whether the channel you are writing to is open is usually a sign of a bug in the program. The send side of the channel is responsible for closing it, so it should know the current state. If you have multiple goroutines sending on the channel, then they should coordinate in closing the channel (e.g. by using a sync.WaitGroup).
In your Time.DoAfter hypothetical, it would depend on whether the channel was buffered. If it was an unbuffered channel, then the goroutine writing to the timer channel would block until someone read from the channel. If that never happened, then the goroutine would remain blocked until the program completed. If the channel was buffered, the send would complete immediately. The channel could be garbage collected before anyone read from it.
The standard library time.After behaves this way, returning a channel with a one slot buffer.
I've got a scheduler and some workers in Azure. The scheduler puts messages into a queue and the workers pull those messages and work on them. I've now just come into a scenario where I will need to move some data from table storage to our database once a certain threshold has been reached. These items need to be processed in order, oldest first. Once that threshold is met all the other items are processed in order. The current message that triggered the transfer needs to be stuffed at the end of the line and be reprocessed.
So, to the meat of my question...
Is it fine to simply resend the message to the queue as is or is there a potential for that to cause problems?
queueProvider.SendMessage(message);
A co-worker mentioned that he "though he might have read something about needing to do something special." I haven't seen anything to confirm his suspicions yet however so I thought I would pose the question here just to be safe.
The short answer is that it is fine. If you have a CloudQueueMessage, you can just send it to any queue (it is just a REST request at the end of the day). Every time you AddMessage(), it creates a new ID (might be same pop receipt but that doesn't matter). That being said, there are some things you might want to take care of and or investigate:
If you push a message onto one queue, pop it, and push to another queue or same queue, you should probably delete the first message off the queue. Merely popping it means that you have set the invisibility time out, but that it will reappear soon (and you now have identical message content on each queue). So, if I pop a message and immediately push it again, I now have 2 messages in the queue with identical content.
You can now update messages. This might be appropriate for you if you need ordering. You can indicate on the message itself in metadata or content what stage of processing it is in and you get some ordering here with a thoughtful implementation.
It is recommended that all logic inside the consumer of the queue be idempotent since a message can actually be picked up more than once. We have to keep in mind that the queue service guarantees that a message will be delivered, AT LEAST ONCE - so you could end up duplicating messages with this approach.
I have a Windows Service (C# 4.0) that picks messages off of a private message queue and for each message sends one or more emails (typically 4 or 5 at most) based on message content.
Message volume is low so I have avoided complexity and left the service sinlge-threaded, but the emails are important so I need to ensure that on an SCM Stop Command any in-process messages/emails are processed/sent before the Stop completes.
In OnStop I am chekcing a static "inProcess" flag representing status and if it is set I am calling ServiceBase.RequestAdditionalTime(120000).
There are 2 problems:
The Stop Command completes immediately with some e-mail unsent, despite the request for 2 minutes.
Even if it worked I am only guessing at how long I should wait.
What is the best way to handle this in a single-threaded service?
Thanks for your help!
Greg
To fully answer, we'd need to see the structure of your message processing loop. But one thing I'm thinking is that the ServiceBase.RequestAdditionalTime() method is used to keep the SCM from complaining if a stop command (or pause, continue, start) takes too long, it doesn't mean your service will wait two minutes before stopping.
Thus, the only thing it truly does is keep the SCM from erroring out on a stop request, if you have a slow stop process.
See MSDN here: RequestAdditionalTime() method
What I'm wondering is if you get called in OnStop() and you set some complete flag, and the processing loop immediately exits when it sees this flag?
If you could post your code it would help me refine this answer, but from the question I wonder if you are expecting the call to wait for 2 minutes to let it process more, but you are setting something to tell the processing loop to stop. If this is not the case I can refine the answer further.
As for how long you should wait, that depends on how critical the emails are and how many are likely to be in the queue, and if they are persisted anywhere so that restarting the service would pick up where they left off.