I have a long running celery task which I can see it in flower at the beginning of execution but after some hours, it's like it doesn't exist! All I see in task details page is Unknown task '894a8b45-5963-40da-a104-7ffff98bc267' and I cannot find that key in Redis but I can tell the task is still running by looking at logs. The task does not fail, it just disappears!
flower keeps task's information in-memory. It is not related to your celery backend (Redis for example).
By default it keeps the last 10,000 tasks so if your system run many tasks it make sense that your task becomes older and being purged to make place for a newer tasks.
Assuming that's the case - you'll see the Unknown task message.
You can tune the number of tasks to keep in memory with max-tasks.
Related
I ran into this FAQ indicating that sharing a persistent job store among two or more processes will lead to incorrect scheduler behavior:
How do I share a single job store among one or more worker processes?
My question is: If there's only a single worker scheduler that has been started via .start(), and another scheduler process is initialized on the same persistent sqlite jobstore only to print the trigger of a certain job_id (won't invoke a .start()), could that lead to cases of incorrect scheduler behavior?
Using apscheduler 3.6.3
Yes. First of all, the scheduler has to be started for it to return you the list of permanently stored jobs. Another potential issue is that the current APScheduler version deletes any jobs on retrieval for which it cannot find the corresponding task function. This behavior was initially added to clear out obsolete jobs, but was in retrospect ill conceived and will be removed in v4.0.
On the upside, it is possible to start the scheduler in paused mode so it won't try to run any jobs but will still give you the list of jobs, so long as all the target functions are importable.
I am putting together a Celery based data ingestion pipeline. One thing I do not see anywhere in the documentation is how to build a flow where workers are only running when there is work to be done. (seems like a major flaw in the design of Celery honestly)
I understand Celery itself won't handle autoscaling of actual servers, thats fine, but when I simulate this Flower doesn't see the work that was submitted unless the worker was online when the task was submitted. Why? I'd love a world where I'm not paying for servers unless there is actual work to be done.
Workflow:
Imagine a While loop thats adding new data to be processed using the celery_app.send_task method.
I have custom code that sees theres N messages in the queue. It spins up a Server, and starts a Celery worker for that task.
Celery worker comes online, and does the work.
BUT.
Flower has no record of that task, even though I see the broker has a "message", and while watchings the output of the worker, I can see it did its thing.
If I keep the worker online, and then submit a task, it monitors everything just fine and dandy.
Anyone know why?
You can use celery autoscaling. For example setting autoscale to 8 will mean it will fire up to 8 processes to process your queue(s). It will have a master process sitting waiting though. You can also set a minimum, for example 2-8 which will have 2 workers waiting but fire up some more (up to 8) if it needs to (and then scale down when the queue is empty).
This is the process based autoscaler. You can use it as a reference if you want to create a cloud based autoscaler for example that fires up new nodes instead of just processes.
As to your flower issue it's hard to say without knowing your broker (redis/rabbit/etc). Flower doesn't capture everything as it relies on the broker doing that and some configuration causes the broker to delete information like what tasks have run.
I have a node.js script that run once in a day on ubuntu EC2 instance. This script pulls data from some hundered thousand remote APIs and save to our local database. Is there any way we can monitor this node.js script on remote server? There have been few instances where script crashed due to some reason and we were unable to figure it out without SSHing into instance and checking the logs. I have however created a small system after first few crashes which send us an email whenever script crashes due to some uncaught exception and also when script completes execution.
However, we need to develop a better system where we can monitor the progress of script via web interface of our admin application which is deployed over some other instance and also trigger start/stop of script via this interface. What are possible options for achieving this?
If you like to stay in Node.js, then there are several process monitoring tools:
PM2 comes with lots of other features besides monitoring processes. You can monitor your processes via CLI or their official web interface: https://keymetrics.io/. A quick search on npm also gives a bunch of nice unofficial gui tools: https://www.npmjs.com/search?q=pm2+web
Forever is not as feature rich as PM2 but will do the basic process operations and couple of gui are also available in npm.
There are two problems here that you are trying to solve:
Scheduling work to be done
Monitoring a process for failure
At a simple level, this is easy: schedule a cron job and restart failed things so they keep trying.
However, when things don't go smoothly, it helps to have a lot more granularity over what you are scheduling, and how it is executed. This would also give you the visibility over each little piece of work.
Adding a little more complexity, you can end up with something like this:
Schedule the script that starts everything (via cron, if that's comfortable)
That script generates several jobs that need to be executed into a queue
A worker process (or n worker processes) consume that queue and execute pending jobs
You can monitor both the progress of the jobs, as well as the state of each worker (# of crashes, failures, jobs completed, etc.). The other tools mentioned above are good candidates for this (forever, pm2, etc.)
When jobs fail, other workers can pick up the small piece of work that was in progress and restart it. This is much more efficient than restarting the entire process, and also lets you parallelize things across n workers based on how you can split up the workloads.
You could easily throw the status onto a web app so you can check in periodically rather than have to dig through server logs.
You can also get more intelligent with different types of failures. Network error? Retry 5 times. Rated limited? Gradual back-off. Crash? Don't retry and notify via email. etc
I have tried this with pm2, you can get the info of the task, then cat out or grab the log files. Or you could have a logging server, see also: https://github.com/papertrail/remote_syslog2
Our node app gets quite big and one job takes quite some time to execute. We run this job with a cronjob, but by calling the URL. Now Heroku has problems with this, because the job takes more than 30 seconds to finish. So we receive a time-out and after that it tries to execute it immediately again, and again, till our Memory quota is about 300% and the app crashes.
Now I want to fix this. Locally we don't have any problems running this script at all. It takes about a minute (for now, but in the future if we have more users it may take more time) to finish and memory stays stable.
Now running this script on the background should fix the problem according https://devcenter.heroku.com/articles/request-timeout#debugging-request-timeouts
Overe here https://devcenter.heroku.com/articles/asynchronous-web-worker-model-using-rabbitmq-in-node#getting-started I read about JackRabbit. But it seems like it's used for systems like RabbitMQ https://github.com/hunterloftis/jackrabbit
So my question: anyone who has experience with background tasks in node? Can and should I use JackRabbit for my background tasks, or are there better solutions? My background task just contains a very complex ExpressJS task, which takes some time to execute so....
I'm the Node.js platform owner at Heroku (and I actually wrote the web worker article you referenced).
Your use case sounds like it may fit the scheduler very well:
https://devcenter.heroku.com/articles/scheduler
It's a great replacement for cron-type jobs.
The Jobs API in Eclipse RCP apparently works much differently than I expected. I thought that creating and scheduling multiple Jobs would actually cause multiple worker threads to be created, executing the Jobs in parallel unless there was an ISchedulingRule conflict.
I went back and read the documentation more closely, and also discovered this comment in the JobManager class:
/**
* Returns a running or blocked job whose scheduling rule conflicts with the
* scheduling rule of the given waiting job. Returns null if there are no
* conflicting jobs. A job can only run if there are no running jobs and no blocked
* jobs whose scheduling rule conflicts with its rule.
*/
Now it looks to me like the Job manager will only ever attempt to use one background worker thread. Am I completely wrong about this? If I'm right,
what is the point of scheduling rules and locks? If there is only one worker thread, Jobs can never preemt each other. Wouldn't these only ever be used in case a Job's sleep() method is called (e.g. sleeping while holding a Lock)?
does any part of the platform allow two Jobs to actually run concurrently, on multiple worker threads, thus making the above features useful somehow?
What am I missing here?
Take a look at the run method in the documentation, specifically this part:
Jobs can optionally finish their
execution asynchronously (in another
thread) by returning a result status
of ASYNC_FINISH. Jobs that finish
asynchronously must specify the
execution thread by calling setThread,
and must indicate when they are
finished by calling the method done.
ASYNC_FINISH there looks interresting.
AFAIK creating and scheduling multiple Jobs DO actually cause multiple worker threads to be created and to be executed in parallel.
However if you specify an optional scheduling rule to your job (using the setRule() method) and if that rule conflicts with another job's scheduling rule then those two jobs can't run simultaneously.
This Eclipse Corner article provides good description as well as few code samples for Eclipse Job API.
The IJobManager API is only needed for advanced job manipulation, e.g. when you need to use locks, synchronize between several jobs, terminate jobs, etc.
Note: Eclipse 4.5M4 will include now (Q4 2014) a way to Support for Job Groups with throttling
See bug 432049:
Eclipse provides a simple Jobs API to perform different tasks in parallel and in asynchronous fashion. One limitation of the Eclipse Jobs is that there is no easy way to limit the number of worker threads being used to execute jobs.
This may lead to a thread pool explosion when many jobs are scheduled in quick succession. Due to that it’s easy to use Jobs to perform different unrelated tasks in parallel, but hard to implement thousands of Jobs co-operating to complete a single large task.
Eclipse currently supports the concept of Job Families, which provides one way of grouping with support for join, cancel, sleep, and wakeup operations on the whole family.
To address all these issue we would like to propose a simple way to group a set of Eclipse Jobs that are responsible for pieces of the same large task.
The API would support throttling, join, cancel, combined progress and error reporting for all of the jobs in the group and the job grouping functionality can be used to rewrite performance critical algorithms to use parallel execution of cooperating jobs.
You can see the implementation in this commit 26471fa