I have a multithreaded program that processes images and I want to make a queue in the program for inserting requests from an external program in this queue and the image processor's threads dequeue the requests from it. Now what is the best solution for when the queue is full and new requests arrive?Should I drop new or old or random requests?
That is totally up to you and how you want your program to behave. See what the outcome of each decision might lead and decide which is the lesser evil.
I recommend using a message broker for this approach.
You can send the images as messages to this broker which will handle the routing to the target program, the complete queuing system as well as the "full queue" problem.
I personally have made some good experience with RabbitMQ, although it might be a little overhead for your special purpose. You may have a look at ZeroMQ as well as it might be a little thinner and a better approach for you.
To do things right, determine what you really need and check out if those message broker are really useful in your current situation - from my point of view and in my opinion they are, but depends on your exact requirements and implementation.
If you're interested look at AMQP, the Advanced Messaging Queuing Protocol itself - it's the basic for all those message brokers and quite interesting.
Related
I'm trying to design a robust architecture, however I'm having trouble on solving the message delivery.
Let me try to explain
The API would be clustered on ECS receiving a bunch of requests.
The Workers would be clustered too subscribing the same channels. (that's the problem, if we were working with only one worker it wouldn't have any issue)
How to deal with multiple workers avoiding duplicated messages?
What would be a good simple approach, keeping many workers occupied.?
Thank you.
This sounds like a very fundamental problem, for a message broker: having one channel and multiple workers subscribed to it, and all of them to receive the same message. It wouldn't really be useful to process the same message multiple times.
This problem has been addressed in most message brokers (I believe). For example, when you consume a message from an Amazon SQS queue, that message is not visible to other consumers, for a particular timeframe (visibility timeout).
When the worker processed the message, it has to delete it from the queue. Otherwise, if the timeout expired, other workers will see the message and process it.
SQS in particular has a distributed architecture and sometimes you get duplicate messages in the queue, which are processed by different workers. That's the effect of the at-least-once delivery guarantee that SQS provides.
If your system has to be strict about duplicate messages, then you need to build a de-duplication mechanism around it.
The keywords you are looking for is "exactly once guarantee in a distributed system". With that you can do some research on your own, but here some pointers.
You could use the right Event Queue System that supports "exactly once" guarantees. For example Apache Pulsar (see this link) or Kafka, or you can use their approach as inspiration in your own implementation (which may be somewhat hard to do).
For your own implementation you could write a special consumer that is the only consumer and acts a distributor for worker tasks and whose task it is to guarantee "exactly once". It would be a tradeoff and could prove a bottleneck, depending on your scalability requirements. This article explains why it is a difficult problem in distributed systems.
I have a Node.js app with a small set of users that is currently architected with a single web process. I'm thinking about adding an after save trigger that will get called when a record is added to one of my tables. When that after save trigger is executed, I want to perform a large number of IO operations to external APIs. The number of IO operations depends on the number of elements in an array column on the record. Thus, I could be performing a large number of asynchronous operations after each record is saved in this particular table.
I thought about moving this work to a background job as suggested in Worker Dynos, Background Jobs and Queueing. The article gives as a rule of thumb that tasks that take longer than 500 ms be moved to background job. However, after working through the example using RabbitMQ (Asynchronous Web-Worker Model Using RabbitMQ in Node), I'm not convinced that it's worth the time to set everything up.
So, my questions are:
For an app with a limited amount of concurrent users, is it ok to leave a long-running function in a web process?
If I eventually decide to send this work to a background job it doesn't seem like it would be that hard to change my after save trigger. Am I missing something?
Is there a way to do this that is easier than implementing a message queue?
For an app with a limited amount of concurrent users, is it ok to leave a long-running function in a web process?
this is more a question of preference, than anything.
in general i say no - it's not ok... but that's based on experience in building rabbitmq services that run in heroku workers, and not seeing this as a difficult thing to do.
with a little practice, you may find that this is the simpler solution, as I have (it allows simpler code, and more robust code, as it splits the web away from the background processor - allowing each to run without knowing about each other directly)
If I eventually decide to send this work to a background job it doesn't seem like it would be that hard to change my after save trigger. Am I missing something?
are you missing something? not really
as long as you write your current in-the-web-process code in a well structured and modular fashion, moving it to a background process is not usually a big deal
most of the panic that people get from having to move code into the background, comes from having their code tightly coupled to the HTTP request / response process (i know from personal experience how painful it can be)
Is there a way to do this that is easier than implementing a message queue?
there are many options for distributed computing and background processing. i personally like RabbitMQ and the messaging patterns that it uses.
i would suggest giving it a try and seeing if it's something that can work well for you.
other options include redis with pub/sub libraries on top of it, using direct HTTP API calls to another web server, or just using a timer in your background process to check database tables on a given frequency and having the code run based on the data it finds.
p.s. you may find my RabbitMQ For Developers course of interest, if you are wanting to dig deeper into RMQ w/ node: http://rabbitmq4devs.com
I have a Java application, which uses an Oracle Queue to store messages in the queue for later processing by multiple threads consuming queued messages. The messages in this queue can be related to each other, and must therefore be processed in a specific order based on the business logic of my application. Basically, I want to achieve that the dequeueing of one message A is held back as long as another message B in the queue has not been completely processed. The only weapon given by Oracle AQ I see here, are the Delay and an Priority parameters. These, however, cannot be used to achieve the scenario outlined above, since there are situations, where two related messages still can be dequeued and processed at the same time. Are there any tools that can help establishing an advanced processing order of messages?
I came to the conclusion that it is not a good idea to order these messages using the queue, because it would need a custom and very specialized dequeue strategy, which has a very bad smell to me, both, complexity and most likely performance wise. It also tries to fix communication protocol issues using the queue, which are application specific and therefore should find treatment in the application itself. Instead, the application / communication protocol should be tolerant enough to handle ordering issues.
I'm planing a multithreaded server (SCGI to be precise). Now, I know that the traditional approach using one thread per connection is not very scalable. I also don't want to use something fancy like libevent, as this is a hobby project and I prefer not to have lots of dependencies in my codebase.
The approach I'm thinking of is to use a threadpool and let one thread listen to the network to queue up any request coming in. The threads managed by the pool then dequeue the requests, receive the data and respond respectively.
This way, I wouldn't have the overhead of constant thread creation while still being able to serve many request in parallel.
Is there some fundamental problem in this architecture design that I'm not aware of or is this a not-ideal-but-still-ok solution?
Thanks!
Looks good. Depending on how much load this server is expected to see this might be an overkill.
One thread per connection is good enough until you start handling dozens if not hundreds requests in parallel. The advantage of one thread per connection is the simplicity, and it might not be worthwhile to give that up.
On the other hand if you are looking for something that needs to handle tons of traffic (either external like webproxy or internal like memcache) you probably should just use libevent. AFAIK all the big boys are using it or something very similar (memcache, haproxy and so on)
Finally, if you are doing this just for fun just use whatever you want :) It's possible to get good performance with all those archs.
I tend to use the following as my standard threading model, but maybe it isn't such a great model. What other suggestions do people have or do they think this is set up well? This is not for a high performance internet server, though performance is sometimes pretty critical and in those cases I use asynchronous networking methods and reuse buffers, but it is the same model.
There is a gui thread to run the gui.
There is a backend thread that handles anything that is computationally intensive (basically anything the gui can hand off that isn't pretty quick to run) and also is in charge of parsing and acting on incoming messages or gui actions.
There is one or more networking threads that take care of breaking an outgoing send into peices if necessary, recieving packets from various sockets and reassembling them into messages.
There is an intermediary static class which serves as an intermediary between the networking and backend threads. It acts as a post office. Messages that need to go out are posted to it by backend threads and networking threads check the "outbox" of this class to find messages to send and post any incoming messages in a static "inbox" this class has (regardless of the socket they arrive from, though that information is posted with the incoming message) which the backend thread checks to find messages from other machines it should act on.
The gui / backend threading interface tends to be more ad hoc and should probably have its own post office like class or some alternative intermediary?
Any comments/suggestions on this threading setup?
My primary concern is that you don't really want to lock yourself into the idea that there can only be one back-end thread. My normal model is to use the MVC at first, make sure all the data structures I use aren't inherently unsafe for a threaded environment, avoid singletons, and then profile like crazy, splitting things out as I go while trying to minimize the number of condition variables I'm leveraging. For long asynchronous tasks, I prefer to spawn a new process, particularly if it's something that might want to let the OS give it a differing priority.
This architecture sounds like the classic Model-View-Controller architecture which is usually considered as good.