Tech stack: celery 5.0.5, flask, python, Windows OS(8 CPUs).
To give a background, my usage requires spawning one worker, one queue per country as per the request payload
I am using celery.control.inspect().active() to see list of active workers and see if worker with {country}_worker exists in that list. If no, spawn a new worker using:
python subprocess.Popen('celery -A main.celery worker --loglevel=info -Q {queue_name} --logfile=logs\\{queue_name}.log --concurrency=1 -n {worker_name}')
This basically starts a new celery worker and a new queue.
My initial understanding was that we can spawn only n number of workers where n is the cpu_count(). So with this understanding, while testing my code I found that when my 9th worker was spawned, I assumed it will wait for any one of the previous 8 workers to finish execution before taking up the task, but as soon as it was spawned it started consuming from the queue while rest 8 workers were also executing and same happened when I spawned more workers(15 workers in total).
This brings me to my question that the --concurrency argument in a celery process is responsible for parallel execution within that worker? If I spawned 15 independent workers does that mean 15 different processes can be executed in parallel?
Any help is appreciated in understanding this concept.
Edit: I also noticed that each new task received in the corresponding worker spawns a new python.exe process(as per the task manager) and the previous python process spawned remains in memory unused. This does not happen when I spawn worker as "solo" rather than "prefork". Problem with using solo? celery.inspect().active() does not return anything if the workers are executing something and respond back when no tasks are in progress.
If your tasks are I/O bound, and it seems they are, then perhaps you should change the concurrency type to Eventlet. Then you can in theory have concurrency set even to 1000. However, it is a different execution model so you need to write your tasks carefully to avoid deadlocks.
If the tasks are CPU-bound, then I suggest you have concurrency set to N-1, where N is number of cores, unless you want to overutilise, in which case you can pick a slightly bigger number.
PS. you CAN spawn many worker-processes, but since they all run concurrently (separate processes in this case) their CPU utilisation would be low so it really makes no sense to go above the number of available cores.
Related
Would like to know, if thread of a process can be made to run on different set of CPU's in Linux?
For instance, let's say we start a process with 30 threads then first 15 threads from this process is made to run on core 0-14 using taskset program, and rest of threads on core 15-29?
Is above configuration possible?
https://code.tutsplus.com/articles/introduction-to-parallel-and-concurrent-programming-in-python--cms-28612
From this link I have studied, I have few questions
Q1 : How thread pool (Concurrent) and threading are different here? why do we see the performance improvement. Threading with Que is having 4 threads and each runs cooperatively during the idle time and picks the item from the Que once they get website response. As i see, the thread pool is also in a way doing the same. completing its work and waiting for the manager to assign a task; which is very similar to picking a new item from the Que. I'm not sure how this is different and why i see the perfroamcne improvment. Seems i'm wrong in interpreting the poling here. Could you expalin
Q2 : Question 2 : using multiprocessing the time taken is more. If I have multiprocessor which can handle multiple processes at a time, then all my 4 processes should be handled by it at a time. That is the real parallelization is happening. Also, I have a question here - in such case since 4 processes are running same function doesn't GIL try to stop them executing the same piece of code. Lets suppose all of them share a common variable that gets updated - like number of websites checked. So how does GIL work in these cases of multiprocessing?
Also, here are the same processes used again and again or they get killed and created every time after their job - I think same processes are used. Also, I think that the performance problem is because of the process creation compared to light weight threads at the concurrent threading phase - which is costly. So could you explain more in detail how the GIL is working here and process are running, are they running cooperatively (like each process wait for its turn - like threads in a process do). Or are these processes using the multiprocessors to run really parallel. Also, my other question is If I have a 8 core machine, I think I can run 8 threads of a same process simultaneously or parallel. if I have the 8 core machine can I run 2 processes with 4 threads each? can I run 8 processes on 8 cores? I think cores are only for threads of a process, which means I cant run the 8 process on 8 cores but I can run as many number of processes as many CPU's or multiprocessor system is mine, am i right? So can I run 2 processes with 4 threads each? on my 8 core machine with 2 multiprocessors and each processor having 4 cores each?
Python has a rich set of libraries for multitasking with Processes and Threads. However, there is overlap between the libraries, the choice depends on how abstractly you view the computational tasks. For example, the concurrent.futures library views threads as asynchronous tasks, while the Threading library deals with them as high-level threads. Further, the _thread implements a low-level interface for threading exposing all the synchronization mechanisms.
The GIL(Global Interpreter Lock) is just a synchronization primitive, specifically a mutex which prevents multiple threads of the same process from executing Python bytecode fragments(for certain objects which need to remain consistent with concurrent operations). This is exactly why Python threads excel with I/O operations in terms of speed when compared to compute intensive tasks.(owing to the fact that the GIL is released in case of certain blocking calls/computationally intensive libraries such as numpy). Note that only CPython and Pypy versions of Python are constrained by the GIL mechanism.
Now, let's see those questions...
How thread pool (Concurrent) and threading are different here? Why do we see the performance improvement?
Coming to the comparison between Threading and concurrent.futures.ThreadPoolExecutor (aka threading_squirrel vs future_squirrel), I've executed both programs with the same test case. There are two factors that contribute to this "performance improvement":
Network HEAD requests: Remember that network operations need not complete in the same time period every time you execute them... due to the very nature of packet transfer delays...
Order of thread execution: In the website you've linked, the author creates all threads initially, sets up the queue full of website links and then starts all of them in a list comprehension loop. In ThreadPoolExecutor of concurrent.futures, each time a task is submitted, a thread is assigned to it if the predefined maximum number of threads/workers have not been reached. I've changed the code to mirror this technique. It seems to give a speedup as the first thread begins work early on and doesn't need to wait for the queue to be filled up...
How does GIL work in these cases of multiprocessing?
Remember that the GIL comes into effect for threads of a process only, not among processes. GIL locks up the whole interpreter bytecode during a thread of execution, so the other threads have to wait for their turn. This is the reason multiprocessing used processes instead of threads, as each process has it's own interpreter and consequently, it's own GIL.
Are the same processes used again and again or they get killed and created every time after their job?
The concept of pooling is to reduce the overhead of creating and destroying workers(be it threads or processes) during computation. However, the processes are kind of "brand new" in the sense that the library effectively asks the OS to perform a fork in an UNIX based OS or spawn in an NT based OS...
Also, are the processes running co-operatively?
Maybe. They have to run in co-operation if they use shared memory...(need not be running together). There is definitely going to be a context switch if there are more processes than the OS can allocate to its processors' cores. They can run in parallel if there's no shared memory updates to make.
If I have the 8 core machine can I run 2 processes with 4 threads each? Can I run 8 processes on 8 cores?
Sure (subject to the GIL, in Python). Each process can be allocated to each processing unit for execution. A processing unit can be a physical or a virtual core of a CPU. As long as the OS scheduler supports it, it's possible. Any reasonable split up of processes and threads are possible. If all are allocatable, that's the best situation, else you will encounter context switches...(which are more expensive when it comes to processes)
Hope I've answered all those questions!
Here are a few resources:
MultiCore CPUs, Multithreading and context switching?
Why does multiprocessing use only a single core after I import numpy?
Bonus celery-squirrel resource
For example, let us assume that in my operating system a context switch to another process occurs after 100μ of execution time. Furthermore, my computer has only one processor with one thread of execution possible.
If I have Process A which contains only one thread of execution and Process B which has four threads of execution, will this mean that the thread in process A will run for 100μ and process B will also run for 100μ but split the execution time between each thread before context switching?
Process A: ran for 100μ
Thread 1 in Process A execution time: 100μ
Process B: ran for 100μ
Thread 1 in Process A execution time: ~25μ
Thread 2 in Process A execution time: ~25μ
Thread 3 in Process A execution time: ~25μ
Thread 4 in Process A execution time: ~25μ
Would the above be correct?
Moreover, would this be different if I had a quad core processor? If I had a quad core processor, would this potentially mean each thread could run for 100μ each across all processors?
It all really depends on what you are doing within the process / processing in each thread. If the process you are trying to run can benefit from splitting over threads, like for example, making calls to a web service for processing (since a web service can accept multiple calls at once and execute then separately), then no... the single thread will take longer to process than the 4 threads simply because it is executing the calls linearly instead of simultaneously.
On the other hand, if you are executing a process / code that does not benefit from thread splitting, then the time to finish all 4 processing threads will be the same on a single core.
However, in most cases, splitting the processing into threads should take less time than executing it on a single thread, if you do it right.
The matter of Cores doesn't factor in in this case unless you are attempting to run more threads than one core can handle. In which case, the OS will run the extra threads on a separate core.
This link explains a bit more the situation with Cores and Hyper-Threading...
http://www.howtogeek.com/194756/cpu-basics-multiple-cpus-cores-and-hyper-threading-explained/
Thread switches are always on the same interval regardless of process ownership. So if it's 100micro then it's always 100micro. Unless of course the thread itself surrenders execution. When this thread is going to run again is where things get complicated
I know linux scheduler will schedule the task_struct which is a thread. Then if we have two processes, e.g., A contains 100 threads while B is single thread, how can the two processes be scheduled fairly, considering if each thread would be scheduled fairly?
In addition, so in Linux, context switch between threads from the same process would be faster than that between threads from different processes, right? Since the latter will have something to do with process control block while the former wouldn't.
The point you are missing here is, how scheduler looks at threads or tasks. Well, the Linux kernel scheduler will treat them as individual scheduling entity, therefore will be counted and scheduled differently.
Now let's see what CFS documentation says - it has a simplistic approach of giving out even slice of CPU time to each runnable process, therefore, if there are 4 runnable process/threads they'll get 25% of cpu time each. But on real hardware it's not possible and to fix the issue vruntime was introduced (take more on this from here
Now come back to your example, if process A creates 100 threads and B creates 1 thread then the # of running processes or threads becomes 103 (assuming all are runnable state) then CFS will evenly share the cpu using formula 1/103 (cpu/number of running tasks). And the context switching is same for all the scheduling entities, threads only shares task's internal mm_struct and when they run they have their own sets of registers, task status to load up to start with. Hope this will help to understand better.
How do I control the number of threads that my program is working on?
I have a program that is now ready for mutithreading but one problem is that the program is extremely memory intensive and i have to limit the number of threads running so that i don't run out of ram. The main program goes through and creates a whole bunch of handles and associated threads in suspended state.
I want the program to activate a set number of threads and when one thread finishes, it will automatically unsuspended the next thread in line until all the work has been completed. How do i do this?
Someone has once mentioned something about using a thread handler, but I can't seem to find any information about how to write one or exactly how it would work.
If anyone can help, it would be greatly appreciated.
Using windows and visual c++.
Note: i don't need to worry about the traditional problems of access with the threads, each one is completely independent of each other, its more of like batch processing rather than true mutithreading of a program.
Thanks,
-Faken
Don't create threads explicitly. Create a thread pool, see Thread Pools and queue up your work using QueueUserWorkItem. The thread pool size should be determined by the number of hardware threads available (number of cores and ratio of hyperthreading) and the ratio of CPU vs. IO your work items do. By controlling the size of the thread pool you control the number of maximum concurrent threads.
A Suspended thread doesn't use CPU resources, but it still consumes memory, so you really shouldn't be creating more threads than you want to run simultaneously.
It is better to have only as many threads as your maximum number of simultaneous tasks, and to use a queue to pass units of work to the pool of worker threads.
You can give work to the standard pool of threads created by Windows using the Windows Thread Pool API.
Be aware that you will share these threads and the queue used to submit work to them with all of the code in your process. If, for some reason, you don't want to share your worker threads with other code in your process, then you can create a FIFO queue, create as many threads as you want to run simultaneously and have each of them pull work items out of the queue. If the queue is empty they will block until work items are added to the queue.
There is so much to say here.
There are a few ways
You should only create as many thread handles as you plan on running at the same time, then reuse them when they complete. (Look up thread pool).
This guarantees that you can never have too many running at the same time. This raises the question of funding out when a thread completes. You can have a callback be called just before a thread terminates where a parameter in that callback is the thread handle that just finished. Use Boost bind and boost signals for that. When the callback is called, look for another task for that thread handle and restart the thread. That way all you have to do is add to the "tasks to do" list and the callback will remove the tasks for you. No polling needed, and no worries about too many threads.