Celery: ensure tasks are executed sequentially - multithreading

In a project I'm currently working on, I issue Celery tasks every once in a while. These tasks are for specific clients, so there are tasks for e.g. clientA, for clientB and for clientC. There are some additional conditions:
Tasks for the same client may never be executed in parallel.
Tasks for the same client must be executed in sequence, i.e. message queue order.
The Celery cookbook (see also this article) shows a locking mechanism that ensures that a single task can only be executed one at a time. This mechanism could easily be adapted to ensure that a task for a single client could only be executed one at a time. This satisfies the first condition.
The second condition is harder to ensure. Since tasks are generated from different processes, I can't use task chaining. Perhaps I could modify the locking mechanism to retry tasks while they are waiting for the lock, but this still can not guarantee order (due to retry timeouts, but also due to a race condition in acquiring the lock.)
For now, i have limited my concurrency to 1 to ensure order, although some of the tasks are taking a long time and this scales quite badly.

Related

Gearman callback with nested jobs

I have a gearman job that runs and itself executes more jobs when in turn may execute more jobs. I would like some kind of callback when all nested jobs have completed. I can easily do this, but my implementations would tie up workers (spin until children are complete) which I do not want to do.
Is there a workaround? There is no concept of "groups" in Gearman AFAIK, so I can't add jobs to a group and have something fire once that group has completed.
As you say, there's nothing built-in to Gearman to handle this. If you don't want to tie up a worker (and letting that worker add tasks and track their completion for you), you'll have to do out-of-band status tracking.
A way to do this is to keep a group identifier in memcached, and increment the number of finished subtasks when a task finishes, and increment the number of total tasks when you add a new one for the same group. You can then poll memcached to see the current state of execution (tasks finished vs tasks total).

Can GCD on iOS handle hundreds of dispatched blocks?

I would like to utilize GCD for, say, a hundred objects that all need to download some data from the server. If I were to loop over these objects, and call something like:
dispatch_queue_t q = dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_HIGH, 0);
dispatch_async(q, ^{
// Download data;
});
Would those blocks get intelligently queued and efficiently executed, or am I going to run into memory issues, performance issues, or even race conditions?
My two cents are that the blocks will be queued as the name suggests and download will start one at a time, and as long as I am cleaning up properly if the application terminates before all the downloads are complete, there shouldn't be problems.
However, another bonus question:
Would I benefit more if I were to create, say, 3-5 queues and download multiple files at any one time by distributing the downloads amongst the queues?
After implementing this functionality, it seems that the accepted answer is not complete, or I missed the point. Dispatching blocks on a global queue, which is a concurrent queue could lead to issues, since there is a maximum number of threads that can be dispatched by GCD (~64). This is only true for concurrent queues, since they need to spawn a thread for each operation. If you, however, create your own queue, that queue will be a sequential queue that executes the blocks one after the other, even when dispatch_async is called. This way, you can be certain that your queue will only spawn 1 thread, and queue your operations, and never reach a thread limit issue.
Addressing your first bunch of question i.e. "Would those blocks get intelligently queued and efficiently executed, or am I going to run into memory issues, performance issues, or even race conditions".
My answers:
Would those blocks get intelligently queued and efficiently executed : Yes, they will without any doubt.
am I going to run into memory issues, performance issues : This is situation dependent. As you are calling async method, so it wont guarantee that second download operation will start after the completion of first one. If you are downloding images or video you may have issues of running out of memory issues because of CFDATA or CFDATA(store) issues, (which can be handled).
or even race conditions : You will never get caught in race condition if you know well how and when to switch threads. For eg : IF you download class delegates need to be called on main thread you will need to start connection on main thread like in NSURLConnection. If you are dealing with UI elements after donwload you will still need to switch your thread. Other wise no race condition or deadlock will occur.
Addressing your second bunch of question "Would I benefit more if I were to create, say, 3-5 queues and download multiple files at any one time by distributing the downloads amongst the queues?"
If I would be in your place I would have gone with single queue for hundreds of objects. I have been in this situation and in my case I have to download thousands of file. I would go for single file download at a time and do cleaning up and then move forward. As if you have hundreds of file to be downloaded even 0.1 MB of extra allocation will cause you a performance issue.
From Concurrent Programming in MacOS X and iOS (O'Reilly):

Threadpool multi-queue job dispatch algorithm

I'm curious to know if there is a widely accepted solution for managing thread resources in a threadpool given the following scenario/constraints:
Incoming jobs are all of the same
nature and could be processed by any
thread in the pool.
Incoming jobs
will be 'bucketed' into different
queues based on some attribute of
the incoming job such that all jobs
going to the same bucket/queue MUST
be processed serially.
Some buckets will be less busy than
others at different points during
the lifetime of the program.
My question is on the theory behind a threadpool's implementation. What algorithm could be used to efficiently allocate available threads to incoming jobs across all buckets?
Edit: Another design goal would be to eliminate as much latency as possible between a job being enqueued and it being picked up for processing, assuming there are available idle threads.
Edit2: In the case I'm thinking of there are a relatively large number of queues (50-100) which have unpredictable levels of activity, but probably only 25% of them will be active at any given time.
The first (and most costly) solution I can think of is to simply have 1 thread assigned to each queue. While this will ensure incoming requests are picked up immediately, it is obviously inefficient.
The second solution is to combine the queues together based on expected levels of activity so that the number of queues is inline with the number of threads in the pool, allowing one thread to be assigned to each queue. The problem here will be that incoming jobs, which otherwise could be processed in parallel, will be forced to wait on each other.
The third solution is to create the maximum number of queues, one for each set of jobs that must be processed serially, but only allocate threads based on the number of queues we expect to be busy at any given time (which could also be adjusted by the pool at runtime). So this is where my question comes in: Given that we have more queues than threads, how does the pool go about allocating idle threads to incoming jobs in the most efficient way possible?
I would like to know if there is a widely accepted approach. Or if there are different approaches - who makes use of which one? What are the advantages/disadvantages, etc?
Edit3:This might be best expressed in pseudo code.
You should probably eliminate nr. 2 from your specification. All you really need to comply to is that threads take up buckets and process the queues inside the buckets in order. It makes no sense to process a serialized queue with another threadpool or do some serialization of tasks in parallel. Thus your spec simply becomes that the threads iterate the fifo in the buckets and it's up to the poolmanager to insert properly constructed buckets. So your bucket will be:
struct task_bucket
{
void *ctx; // context relevant data
fifo_t *queue; // your fifo
};
Then it's up to you to make the threadpool smart enough to know what to do on each iteration of the queue. For example the ctx can be a function pointer and the queue can contain data for that function, so the worker thread simply calls the function on each iteration with the provided data.
Reflecting the comments:
If the size of the bucket list is known before hand and isn't likely to change during the lifetime of the program, you'd need to figure out if that is important to you. You will need some way for the threads to select a bucket to take. The easiest way is to have a FIFO queue that is filled by the manager and emptied by the threads. Classic reader/writer.
Another possibility is a heap. The worker removes the highest priority from the heap and processes the bucket queue. Both removal by the workers and insertion by the manager reorders the heap so that the root node is the highest priority.
Both these strategies assume that the workers throw away the buckets and the manager makes new ones.
If keeping the buckets is important, you run the risk of workers only attending to the last modified task, so the manager will either need to reorder the bucket list or modify priorities of each bucket and the worker iterates looking for the highest priority. It is important that memory of ctx remains relevant while threads are working or threads will have to copy this as well. Workers can simply assign the queue locally and set queue to NULL in the bucket.
ADDED: I now tend to agree that you might start simple and just keep a separate thread for each bucket, and only if this simple solution is understood to have problems you look for something different. And a better solution might depend on what exactly problems the simple one causes.
In any case, I leave my initial answer below, appended with an afterthought.
You can make a special global queue of "job is available in bucket X" signals.
All idle workers would wait on this queue, and when a signal is put into the queue one thread will take it and proceed to the corresponding bucket to process jobs there until the bucket becomes empty.
When an incoming job is submitted into an in-order bucket, it should be checked whether a worker thread is assigned to this bucket already. If assigned, the new job will be eventually processed by this worker thread, so no signal should be sent. If not worker is assigned, check whether the bucket is empty or not. If empty, place a signal into the global signal queue that a new job has arrived in this bucket; if not empty, such a signal should have been made already and a worker thread should soon arrive, so do nothing.
ADDED: I got a thought that my idea above can cause starvation for some jobs if the number of threads is less than the number of "active" buckets and there is a non-ending flow of incoming tasks. If all threads are already busy and a new job arrives into a bucket that is not yet served, it may take long time before a thread is freed to work on this new job. So there is a need to check if there are idle workers, and if not, create a new one... which adds more complexity.
Keep it Simple: I'd use 1 thread per queue. Simplicity is worth a lot, and threads are quite cheap. 100 threads won't be an issue on most OS's.
By using a thread per queue, you also get a real scheduler. If a thread blocks (depends on what you're doing), another thread can be queued. You won't get deadlock until every single one blocks. The same cannot be said if you use fewer threads - if the queues the threads happen to be servicing block, then even if other queues are "runnable" and even if these other queue's might unblock the blocked threads, you'll have deadlock.
Now, in particular scenarios, using a threadpool may be worth it. But then you're talking about optimizing a particular system, and the details matter. How expensive are threads? How good is the scheduler? What about blocking? How long are the queues, how frequently updated, etc.
So in general, with just the information that you have around 100 queues, I'd just go for a thread per queue. Yes, there's some overhead: all solutions will have that. A threadpool will introduce synchronization issues and overhead. And the overhead of a limited number of threads is fairly minor. You're mostly talking about around 100MB of address space - not necessarily memory. If you know most queues will be idle, you could further implement an optimization to stop threads on empty queues and start them when needed (but beware of race conditions and thrashing).

Why are message queues used insted of mulithreading?

I have the following query which i need someone to please help me with.Im new to message queues and have recently started looking at the Kestrel message queue.
As i understand,both threads and message queues are used for concurrency in applications so what is the advantage of using message queues over multitreading ?
Please help
Thank you.
message queues allow you to communicate outside your program.
This allows you to decouple your producer from your consumer. You can spread the work to be done over several processes and machines, and you can manage/upgrade/move around those programs independently of each other.
A message queue also typically consists of one or more brokers that takes care of distributing your messages and making sure the messages are not lost in case something bad happens (e.g. your program crashes, you upgrade one of your programs etc.)
Message queues might also be used internally in a program, in which case it's often just a facility to exchange/queue data from a producer thread to a consumer thread to do async processing.
Actually, one facilitates the other. Message queue is a nice and simple multithreading pattern: when you have a control thread (usually, but not necessarily an application's main thread) and a pool of (usually looping) worker threads, message queues are the easiest way to facilitate control over the thread pool.
For example, to start processing a relatively heavy task, you submit a corresponding message into the queue. If you have more messages, than you can currently process, your queue grows, and if less, it goes vice versa. When your message queue is empty, your threads sleep (usually by staying locked under a mutex).
So, there is nothing to compare: message queues are part of multithreading and hence they're used in some more complicated cases of multithreading.
Creating threads is expensive, and every thread that is simultaneously "live" will add a certain amount of overhead, even if the thread is blocked waiting for something to happen. If program Foo has 1,000 tasks to be performed and doesn't really care in what order they get done, it might be possible to create 1,000 threads and have each thread perform one task, but such an approach would not be terribly efficient. An second alternative would be to have one thread perform all 1,000 tasks in sequence. If there were other processes in the system that could employ any CPU time that Foo didn't use, this latter approach would be efficient (and quite possibly optimal), but if there isn't enough work to keep all CPUs busy, CPUs would waste some time sitting idle. In most cases, leaving a CPU idle for a second is just as expensive as spending a second of CPU time (the main exception is when one is trying to minimize electrical energy consumption, since an idling CPU may consume far less power than a busy one).
In most cases, the best strategy is a compromise between those two approaches: have some number of threads (say 10) that start performing the first ten tasks. Each time a thread finishes a task, have it start work on another until all tasks have been completed. Using this approach, the overhead related to threading will be cut by 99%, and the only extra cost will be the queue of tasks that haven't yet been started. Since a queue entry is apt to be much cheaper than a thread (likely less than 1% of the cost, and perhaps less than 0.01%), this can represent a really huge savings.
The one major problem with using a job queue rather than threading is that if some jobs cannot complete until jobs later in the list have run, it's possible for the system to become deadlocked since the later tasks won't run until the earlier tasks have completed. If each task had been given a separate thread, that problem would not occur since the threads associated with the later tasks would eventually manage to complete and thus let the earlier ones proceed. Indeed, the more earlier tasks were blocked, the more CPU time would be available to run the later ones.
It makes more sense to contrast message queues and other concurrency primitives, such as semaphores, mutex, condition variables, etc. They can all be used in the presence of threads, though message-passing is also commonly used in non-threaded contexts, such as inter-process communication, whereas the others tend to be confined to inter-thread communication and synchronisation.
The short answer is that message-passing is easier on the brain. In detail...
Message-passing works by sending stuff from one agent to another. There is generally no need to coordinate access to the data. Once an agent receives a message it can usually assume that it has unqualified access to that data.
The "threading" style works by giving all agent open-slather access to shared data but requiring them to carefully coordinate their access via primitives. If one agent misbehaves, the process becomes corrupted and all hell breaks loose. Message passing tends to confine problems to the misbehaving agent and its cohort, and since agents are generally self-contained and often programmed in a sequential or state-machine style, they tend not to misbehave as often — or as mysteriously — as conventional threaded code.

Check number of idle cores when creating .Net 4.0 Parallel Task

My question might sound a bit naive but I'm pretty new with multi-threaded programming.
I'm writing an application which processes incoming external data. For each data that arrives a new task is created in the following way:
System.Threading.Tasks.Task.Factory.StartNew(() => methodToActivate(data));
The items of data arrive very fast (each second, half second, etc...), so many tasks are created. Handling each task might take around a minute. When testing it I saw that the number of threads is increasing all the time. How can I limit the number of tasks created, so the number of actual working threads is stable and efficient. My computer is only dual core.
Thanks!
One of your issues is that the default scheduler sees tasks that last for a minute and makes the assumption that they are blocked on another tasks that have yet to be executed. To try and unblock things it schedules more pending tasks, hence the thread growth. There are a couple of things you can do here:
Make your tasks shorter (probably not an option).
Write a scheduler that deals with this scenario and doesn't add more threads.
Use SetMaxThreads to prevent
unbounded thread pool growth.
See the section on Thread Injection here:
http://msdn.microsoft.com/en-us/library/ff963549.aspx
You should look into using the producer/consumer pattern with a BlockingCollection<T> around a ConcurrentQueue<T> where you set the BoundedCapacity to something that makes sense given the characteristics of your workload. You can make your BoundedCapacity configurable and then tweak as you run through some profiling sessions to find the sweet spot.
While it's true that the TPL will take care of queueing up the tasks you create, creating too many tasks does not come without penalties. Also, what's the point in producing more work than you can consume? You want to produce enough work that the consumers will never be starved, but you don't want to get to far ahead of yourself because that's just wasting resources and potentially stealing those very same resources from your consumers.
You can create a custom TaskScheduler for the Task Parallel library and then schedule tasks on that by passing an instance of it to the TaskFactory constructor.
Here's one example of how to do that: Task Scheduler with a maximum degree of parallelism.

Resources