I have a Celery worker process that was started with this command:
celery multi start worker --app=xyz.celery --queue="xyz"
--pidfile="/var/run/xyz/%n.pid"
--pool=gevent --concurrency=500 --time-limit=1800
I have tasks which are safe for gevent concurrency, but not for OS threads, and I'm seeing an intermittent error that suggests they're being run by multiple OS threads.
Looking at the worker process, it appears to have 7 threads in total:
$ ps -ef | grep "celery worker"
nobody 26577 1 0 Mar06 ? 00:46:43 /usr/bin/python -m celery worker
--time-limit=1800 --concurrency=500 --pool=gevent --app=xyz.celery
--queue=xyz --pidfile=/var/run/xyz/xyz-worker.service.pid --hostname=worker#xyz
$ cat /proc/26577/status
Name: python
...
...
Threads: 7
...
(I can also see via ps -T or via htop that the worker has these 7 threads)
On other servers where I have a similar setup, I have 4 threads instead of 7. I can't figure out what controls this. I don't see anything in the Celery documentation that explains it.
All my servers have 4 cpus, so it's clearly not that. From everything I've read, it should be just one thread, since I've told it to use gevent for concurrency.
Why does it use more than 1, what determines the number, and how can I control it?
It turns out that these threads are maintained by gevent but they're not used to run user code:
By default, gevent will create threads to handle DNS resolution in a
cooperative fashion (invisibly to the caller). gevent will never run
user code in a separate thread implicitly without being explicitly
instructed to do so by direct usage of a thread pool
Related
Would like to know, if thread of a process can be made to run on different set of CPU's in Linux?
For instance, let's say we start a process with 30 threads then first 15 threads from this process is made to run on core 0-14 using taskset program, and rest of threads on core 15-29?
Is above configuration possible?
Tech stack: celery 5.0.5, flask, python, Windows OS(8 CPUs).
To give a background, my usage requires spawning one worker, one queue per country as per the request payload
I am using celery.control.inspect().active() to see list of active workers and see if worker with {country}_worker exists in that list. If no, spawn a new worker using:
python subprocess.Popen('celery -A main.celery worker --loglevel=info -Q {queue_name} --logfile=logs\\{queue_name}.log --concurrency=1 -n {worker_name}')
This basically starts a new celery worker and a new queue.
My initial understanding was that we can spawn only n number of workers where n is the cpu_count(). So with this understanding, while testing my code I found that when my 9th worker was spawned, I assumed it will wait for any one of the previous 8 workers to finish execution before taking up the task, but as soon as it was spawned it started consuming from the queue while rest 8 workers were also executing and same happened when I spawned more workers(15 workers in total).
This brings me to my question that the --concurrency argument in a celery process is responsible for parallel execution within that worker? If I spawned 15 independent workers does that mean 15 different processes can be executed in parallel?
Any help is appreciated in understanding this concept.
Edit: I also noticed that each new task received in the corresponding worker spawns a new python.exe process(as per the task manager) and the previous python process spawned remains in memory unused. This does not happen when I spawn worker as "solo" rather than "prefork". Problem with using solo? celery.inspect().active() does not return anything if the workers are executing something and respond back when no tasks are in progress.
If your tasks are I/O bound, and it seems they are, then perhaps you should change the concurrency type to Eventlet. Then you can in theory have concurrency set even to 1000. However, it is a different execution model so you need to write your tasks carefully to avoid deadlocks.
If the tasks are CPU-bound, then I suggest you have concurrency set to N-1, where N is number of cores, unless you want to overutilise, in which case you can pick a slightly bigger number.
PS. you CAN spawn many worker-processes, but since they all run concurrently (separate processes in this case) their CPU utilisation would be low so it really makes no sense to go above the number of available cores.
I'm trying to understand how the parent and various child OS threads work in a haskell program compiled with GHC -threaded.
Using
module Main where
import Control.Concurrent
main = do
threadDelay 9999999999
Compiling with -threaded on ghc 8.6.5, and running with +RTS -N3 for instance, I can see
$ pstree -p 6615
hello(6615)─┬─{ghc_ticker}(6618)
├─{hello:w}(6616)
├─{hello:w}(6617)
├─{hello:w}(6619)
├─{hello:w}(6620)
├─{hello:w}(6621)
├─{hello:w}(6622)
└─{hello:w}(6623)
It looks like I get N*2 + 1 of these "hello:w" threads as I vary +RTS -N.
What are these "hello:w" threads, and why are there apparently two per HEC + 1?
And what does ghc_ticker do?
I also noticed on a large real service I'm testing with +RTS -N4 I'm getting e.g. 14 of these "my-service:w" threads, and when under load these process IDs seem to churn (half of them stay alive until I kill the service).
Why 14, and why are half of them spawned and die?
I'd also accept an answer that helped guide me to instrumenting my code to figure out these latter two questions.
The ghc_ticker in spawned at startup, it runs this function. It's purpose is described as
The interval timer is used for profiling and for context switching in
the threaded build.
The other *:w threads are workers, they are created whenever there is more work to do (aka Task), but there are no more spare workers, see here
On startup ghc creates one worker per capability, then they are created as needed and reused when possible. It's hard to say why you have 14 workers in -N4 case. I can only guess that they are serving IO manager threads: see here. Let's not forget about FFI also - FFI call may block worker. You can try to put a breakpoint in createOSThread to see why workers are created.
You can read more about scheduler here
ADDED:
Hmm, I think I can explain the N*2+1 workers: N workers per capability are created at startup; N more - IO manager event loops, one per capability; plus one IO manager timer thread. Though I'm not sure why the first N workers (created at startup) where not reused for IO manager threads.
I learned from other articles that node js is single thread, but when I run node on my server, I found it has 4 additional node thread and 4 V8 worker thread.
I want to know why. what's the responsibility of all these thread?
Can anyone provide some useful documents or some explaination?
my actions :
start my node program.
ps aux |grep xxx to find the pid
use top -Hp [the pid] to show the thread of node process
screen shot of top -Hp
I infer frome node source code that the node js process has two type of tread pool:
libuv thread pool. Thread name is same as node process. It execute the task about I/O operation.
V8 WorkerThread pool. These thread do some background works, such as GC, complie optimizing, I guess. The default pool size is 4. You can run node with option --v8-pool-size=N to change it and use top -Hp to view it.
v8-platform.h contains some cues about background thread and foreground thread.
If anyone has ideas, welcome to improve this answer.
I have deployed a wsgi application on the apache and I have configured it like this:
WSGIDaemonProcess wsgi-pcapi user= group= processes=2 threads=15
After I restart the apache I am counting the number of threads:
ps -efL | grep | grep -c httpd
The local apache is running only one wsgi app but the number I get back is 36 and I cannot understand why. I know that there are 2 processes and 15 threads which means:
15*2+2=32
So why do I have 4 more?
You mean why do you have 3 extra per mod_wsgi daemon process.
For your configuration, 15 new threads will be created for handling the requests. The other 3 in a process are due to:
The main thread which the process was started as. It will wait until the appropriate signal is received to shutdown the process.
A monitor thread which checks for certain events to occur and which will signal the process to shutdown.
A deadlock thread which checks to see if a deadlock has occurred in the Python interpreter. If it does occur, it will sent an event which thread (2) will detect. Thread (2) would then send a signal to the process to quit. That signal would be detected by thread (1) which would then gracefully exit the process and try and cleanup properly.
So the extra threads are all about ensuring that the whole system is very robust in the event of various things that can occur. Plus ensuring that when the process is being shutdown that the Python sub intepreters are destroyed properly to allow atexit registered Python code to run to do its own cleanup.