I am currently trying Apache Airflow on my system (Ubuntu 18) and I set it up with postgreSQL and RabbitMQ to use the CeleryExecutor.
I run airflow webserver and airflow scheduler on separate consoles, but the scheduler is only putting tasks as queued but no worker is actually running them.
I tried opening a different terminal and run airflow worker on its own and that seemed to do the trick.
Now the scheduler puts tasks on a queue and the worker I ran manually actually executes them.
As I have read, that should not be the case. The scheduler should run the workers on its own right? What could I do to make this work?
I have checked the logs from the consoles and I don't see any errors.
This is expected. If you look at the docs for airflow worker, it is specifically to bring up a Celery worker when you're using the CeleryExecutor, while the other executors do not require a separate process for tasks to run.
LocalExecutor: uses multiprocessing to run tasks within the scheduler.
SequentialExecutor: just runs one task at a time so that happens within the scheduler as well.
CeleryExecutor: scales out by having N workers, so having it as a separate command lets you run a worker on as many machines as you'd like.
KubernetesExecutor: I imagine talks to your Kubernetes cluster to tell it to run tasks.
Related
I have a system that has important long-running tasks which are executed by Celery workers. Assume that we have deployed our application using k8s or docker-compose.
How can I change the celery workers' code in production without losing the tasks that they are currently executing?
In another word, I want an elegant automated way to execute all unfinished tasks with the new workers.
I’m using Redis 4.3.3 as the broker and my Celery version is 5.2.7.
I have added result_backend and tried following settings but Celery didn't reschedule the running tasks after I ran "docker-compose restart worker_service_name".
CELERY_ACKS_LATE = True
CELERY_TASK_REJECT_ON_WORKER_LOST = True
This answer should provide some information about running on Kubernetes.
In addition, I would recommend adding (doc):
CELERYD_PREFETCH_MULTIPLIER = 1
How many messages to prefetch at a time multiplied by the number of
concurrent processes. The default is 4 (four messages for each
process). The default setting is usually a good choice, however – if
you have very long running tasks waiting in the queue and you have to
start the workers, note that the first worker to start will receive
four times the number of messages initially. Thus the tasks may not be
fairly distributed to the workers.
To disable prefetching, set worker_prefetch_multiplier to 1. Changing
that setting to 0 will allow the worker to keep consuming as many
messages as it wants.
So we have a kubernetes cluster running some pods with celery workers. We are using python3.6 to run those workers and celery version is 3.1.2 (I know, really old, we are working on upgrading it). We have also setup some autoscaling mechanism to add more celery workers on the fly.
The problem is the following. So let's say we have 5 workers at any given time. Then lot of tasks come, increasing the CPU/RAM usage of the pods. That triggers an autoscaling event, adding, let's say, two more celery worker pods. So now those two new celery workers take some long running tasks. Before they finishing running those tasks, kubernetes creates a downscaling event, killing those two workers, and killing those long running tasks too.
Also, for legacy reasons, we do not have a retry mechanism if a task is not completed (and we cannot implement one right now).
So my question is, is there a way to tell kubernetes to wait for the celery worker to have run all of its pending tasks? I suppose the solution must include some way to notify the celery worker to make it stop receiving new tasks also. Right now I know that Kubernetes has some scripts to handle this kind of situations, but I do not know what to write on those scripts because I do not know how to make the celery worker stop receiving tasks.
Any idea?
I wrote a blog post exactly on that topic - check it out.
When Kubernetes decide to kill a pod, it first send SIGTERM signal, so your Application have time to gracefully shutdown, and after that if your Application didn't end - Kubernetes will kill it by sending a SIGKILL signal.
This period, between SIGTERM to SIGKILL can be tuned by terminationGracePeriodSeconds (more about it here).
In other words, if your longest task takes 5 minutes, make sure to set this value to something higher than 300 seconds.
Celery handle those signals for you as you can see here (I guess it is relevant for your version as well):
Shutdown should be accomplished using the TERM signal.
When shutdown is initiated the worker will finish all currently
executing tasks before it actually terminates. If these tasks are
important, you should wait for it to finish before doing anything
drastic, like sending the KILL signal.
As explained in the docs, you can set the acks_late=True configuration so the task will run again if it stopped accidentally.
Another thing that I didn't find documentation for (almost sure I saw it somewhere) - Celery worker won't receive a new tasks after getting a SIGTERM - so you should be safe to terminate the worker (might require to set worker_prefetch_multiplier = 1 as well).
I have a spark job that periodically hangs, leaving my AWS EMR cluster in a state where an application is RUNNING but really the cluster is stuck. I know that if my job doesn't get stuck, it'll finish in 5 hours or less. If it's still running after that, it's a sign that the job is stuck. Yarn and the Spark UI is still responsive, the it's just that an executor gets stuck on a task.
Background: I'm using an ephemeral EMR cluster that performs only one step before terminating, so it's not a problem to kill it off if I notice this job is hanging.
What's the easiest way to kill the task, job, or cluster in this case? Ideally this would not involve setting up some extra service to monitor the job -- ideally there would be some kind of spark / yarn / emr setting I could use.
Note: I've tried using spark speculation to unblock the stuck spark job, but that doesn't help.
EMR has a Bootstrap Actions feature where you can run scripts that start up when initializing the cluster. I've used this feature along with a startup script that monitors how long the cluster has been online and terminates itself after a certain time.
I use a script based off this one for the bootstrap action. https://github.com/thomhopmans/themarketingtechnologist/blob/master/6_deploy_spark_cluster_on_aws/files/terminate_idle_cluster.sh
Basically make a script that checks /proc/uptime to see how long the EC2 machine has been online and after uptime surpasses your time limit you can send a shutdown command to the cluster.
Trying to figure out a way to start / run an external process in all workers before starting the jobs / tasks.
Specific use case - my job hits a service running on the node (localhost). The service itself is run via a docker container. I want to start the docker container before starting the tasks on a worker and then stop the container after all the jobs are done.
One approach could be to do rdd.mapPartitions, but that is at a executor level and I cannot cleanly do a stop as another partition might be executing on the same node. Any suggestions?
As a workaround, currently I start the docker containers while starting up the cluster itself, but that does not allow me to work with multiple different containers that may be required for different jobs (as in that case all containers will be running at all the time taking up node resources.)
I have noticed that on some of my sidekiq workers, they appear to be running multiple processes (Working multiple jobs concurrently) in a single dyno (The logs would suggest this).
How many processes could be/are running separate jobs within a single dyno concurrently without using swarming (The enterprise feature)?
I have everything set up to defaults without using swarms, so each sidekiq worker is using 25 threads. What exactly all these threads are used for, however, I have no idea. Can anyone help me understand how this translates into concurrent workers working jobs inside a single Heroku dyno?
You are seeing a single Sidekiq process with 25 threads running jobs concurrently. Each thread will execute a job so you can have up to 25 jobs running at once.
Without swarm, you can only run one process per dyno.
You can run multiple processes in a dyno using swarm but how many depends on the memory requirements of your app and how many cores in the dyno.
This will get you 100 worker threads: 4*25.
SIDEKIQ_COUNT=4 bundle exec sidekiqswarm -e production -c 25