I have a script which which when run adds rss feed parsing tasks to some celery queues. Now I have implemented apscheduler to run the script every 2 hours to get new data from the feeds.
My implementation looks like this:
#!/usr/bin/env python
import atexit
import logging
import os
from logging import getLogger
from apscheduler.schedulers.blocking import BlockingScheduler
logger = getLogger('scheduled_parser')
PARSER_SCHEDULER = 'parser_scheduler'
def main():
scheduler = BlockingScheduler(job_defaults={'coalesce': True})
scheduler.add_jobstore('sqlalchemy',alias='scheduler_config', url=os.environ.get("DATABASE_URL"))
scheduler.add_job(run_parser, 'interval', seconds=int(os.environ.get("SCHEDULER_RUN_FREQUENCY")),
id=PARSER_SCHEDULER, replace_existing=True)
scheduler.start()
atexit.register(lambda: scheduler.shutdown())
def run_parser():
< code to add items to queues>
if __name__ == "__main__":
logging.basicConfig()
logger.setLevel(logging.INFO)
main()
My code is deployed on heroku and I have following in my procfile
clock: python scheduled_parser
<celery worker processes>
I am having following issues:
I am storing the scheduler job in persistant storage and I can even see it in my db, but when I do scheduler.get_job(PARSER_SCHEDULER,'scheduler_config') I get None
Whenever I deploy on heroku, I think the next run is being updated. For example if parser is set to run every 2 hours and next run going to be at 4:00pm and if I deploy on Heroku at 3:00pm then my next run happens at 5:00pm instead of 4:00pm.
Not sure about your issue #1, but I think issue #2 is that on every deploy, this line is going to replace the job, thus resetting the schedule:
scheduler.add_job(run_parser, 'interval', seconds=int(os.environ.get("SCHEDULER_RUN_FREQUENCY")),
id=PARSER_SCHEDULER, replace_existing=True)
Related
I've got a simple FastAPI webapp going and I'd like to be able to check the database connection on startup (and retry connection if it fails)
I've got the following code, but it doesn't feel right
# main.py
import uvicorn
from backend.app import app
if __name__ == "__main__":
uvicorn.run(app, port=8001)
# app.py
# ... omitted for brevity
from backend.database import notes, tags
# ... omitted for brevity
# database.py
from motor.motor_asyncio import AsyncIOMotorClient
from asyncio import get_event_loop
client = AsyncIOMotorClient("localhost", 27027)
loop = get_event_loop()
data = loop.run_until_complete(client.server_info())
db = client.notes_db
notes = db.notes
tags = db.tags
Without get_event_loop() and the subsequent loop.run_until_complete() call it won't test the database connection until you actually try to access / write to it.
My goal is to be able to halt the startup process until it can successfully connect to a database, is there any clean way to do this with Python and motor.io (https://motor.readthedocs.io/, sorry there's no tag for it) ?
the startup event in FastAPI is the deal here I guess. I addition this repository is a nice example and this thread could even provide you with more information. You could execute your tests within the startup event. This means the application won't start until the startup event has been successfully executed.
I am unable to get rq_scheduler working. Here is a simple example:
app.py
from flask import Flask
import datetime
from redis import Redis
from rq import Queue
from rq_scheduler import Scheduler
from tasks import example
app=Flask(__name__)
app.secret_key='abc'
app.redis = Redis.from_url('redis://')
app.task_queue = Queue('test', connection=app.redis)
scheduler = Scheduler(queue=app.task_queue,connection=app.redis)
#app.task_queue.enqueue('tasks.example',2)
#scheduler.enqueue_at(datetime.datetime(2020,4,16,10,46), example, 2)
scheduler.enqueue_in(datetime.timedelta(seconds=1), example, 2)
if __name__=='__main__':
app.run(host='0.0.0.0', port=5000, debug=True)
tasks.py
import time
def example(seconds):
print('Starting task')
for i in range(seconds):
print(i)
time.sleep(1)
print('Task completed')
In the app directory in terminal, I start the following in separate tabs:
$redis-server
$rq worker test
$rqscheduler
$python app.py
The first queue.enqueue works fine. Both scheduler tasks do nothing. What is wrong?
I suspect that you may be getting confused because rqscheduler by default checks for new jobs every one minute. You can tweak this with the -i flag to set the interval in seconds, and also add the -v flag for more verbose output:
rqscheduler -i 1 -v
However I also noticed another issue with the above Flask code...
Probably due to the dev server spawning a separate process I was finding that the scheduler.enqueue_in function was enqueuing the job twice. This probably wouldn't be an issue if the enqueue_in function was called inside a view function. However where you have it placed this actually runs when the application is started.
So when launching with the dev server this gets executed twice. This will then run once every time the autoreloader senses a code change: So after starting the dev server, then saving a change to the code, 3 jobs total have been enqueued.
For the purpose of testing this, it may be advisable just to have a simple python script which doesn't actually run the Flask app:
# enqueue_test.py
from redis import Redis
from rq import Queue
from rq_scheduler import Scheduler
from tasks import example
r = Redis.from_url('redis://localhost:6379')
q = Queue('test', connection=r)
scheduler = Scheduler(queue=q, connection=r)
scheduler.enqueue_in(datetime.timedelta(seconds=1), example, 2)
I have a task queue processing service that I'm trying to run pytest function testing on. When running it in 'production', I start this from the command line, e.g. python main.py.
I can't figure out how to start this task service from pytest to do function testing on it. How do I start up the service inside pytest so that I can then add a job to it and see if the job gets processed and added to the database when completed?
def main():
store = "jobs"
worker_id = 1
# Process tasks
task_processing[store] = multiprocessing.Process(
target=process_tasks, args=(store, worker_id)
)
nanopub_processing[store].start()
if __name__ == "__main__":
main()
Just make sure you access the main function correctly:
from main import main
def test_main():
main()
...
I am working python program using flask, where i want to extract keys from dictionary. this keys is in text format. But I want to repeat this above whole process after every specific interval of time. And display this output on local browser each time.
I have tried this using flask_apscheduler. The program run and shows output but only once, but dose not repeat itself after interval of time.
This is python program which i tried.
#app.route('/trend', methods=['POST', 'GET'])
def run_tasks():
for i in range(0, 1):
app.apscheduler.add_job(func=getTrendingEntities, trigger='cron', args=[i], id='j'+str(i), second = 5)
return "Code run perfect"
#app.route('/loc', methods=['POST', 'GET'])
def getIntentAndSummary(self, request):
if request.method == "POST":
reqStr = request.data.decode("utf-8", "strict")
reqStrArr = reqStr.split()
reqStr = ' '.join(reqStrArr)
text_1 = []
requestBody = json.loads(reqStr)
if requestBody.get('m') is not None:
text_1.append(requestBody.get('m'))
return jsonify(text_1)
if (__name__ == "__main__"):
app.run(port = 8000)
The problem is that you're calling add_job every time the /trend page is requested. The job should only be added once, as part of the initialization, before starting the scheduler (see below).
It would also make more sense to use the 'interval' trigger instead of 'cron', since you want your job to run every 5 seconds. Here's a simple working example:
from flask import Flask
from flask_apscheduler import APScheduler
import datetime
app = Flask(__name__)
#function executed by scheduled job
def my_job(text):
print(text, str(datetime.datetime.now()))
if (__name__ == "__main__"):
scheduler = APScheduler()
scheduler.add_job(func=my_job, args=['job run'], trigger='interval', id='job', seconds=5)
scheduler.start()
app.run(port = 8000)
Sample console output:
job run 2019-03-30 12:49:55.339020
job run 2019-03-30 12:50:00.339467
job run 2019-03-30 12:50:05.343154
job run 2019-03-30 12:50:10.343579
You can then modify the job attributes by calling scheduler.modify_job().
As for the second problem which is refreshing the client view every time the job runs, you can't do that directly from Flask. An ugly but simple way would be to add <meta http-equiv="refresh" content="1" > to the HTML page to instruct the browser to refresh it every second. A much better implementation would be to use SocketIO to send new data in real-time to the web client.
I would recommend that you start a demonized thread, import your application variable, then you can use with app.app_context() in order to log into to your console.
It's a little bit more fiddly but allows the application to run separated by different threads.
I use this method to fire off a bunch of http requests concurrently. The alternative is wait for each response before making a new one.
I'm sure you've realised that the thread will become occupied of you run an infinitely running command.
Make sure to demonize the thread so that when you stop your web app it will kill the thread at the same time gracefully.
I am creating a compute cluster in python using dispy. One of my use cases would be very nicely solved by starting a process on a compute node that itself starts a distributed process. As such, I have implemented the SharedJobCluster on the primary scheduler, and also in the function that will be sent to the cluster (which should in turn, start a series of distributed processes). However, when the second SharedJobCluster is initiated, the code hangs and does not move past this line (nor show any errors).
Minimum working example:
def clusterfun():
import dispy
import test2
import logging
log_filename = 'worker.log'
logging.basicConfig(filename=log_filename,
level=logging.DEBUG,
format='%(asctime)s %(name)-12s %(levelname)-8s %(message)s',
datefmt='[%m-%d-%Y %H:%M:%S]')
logging.info("Starting cluster...")
# THE FOLLOWING LINE HANGS
cluster = dispy.SharedJobCluster(test2.clusterfun2, port=0, scheduler_node='127.0.0.1')
logging.info("Started cluster...")
job = cluster.submit()
logging.info("Submitted job...")
return job()
if __name__ == '__main__':
import dispy
#
# Start the Compute cluster
#
cluster = dispy.SharedJobCluster(clusterfun, port=0, depends=['test2.py'], scheduler_node='127.0.0.1')
job = cluster.submit()
print(job())
test2.py contains:
def clusterfun2():
return "Foo"
For reference, I am currently running the dispyscheduler.py, dispynode, and this python code all on the same machine. This setup works, except when trying to initiate embedded distribution task.
The worker.log output contains "Starting cluster..." but nothing else.
If I check the status of the node it says that it is running 1 job, but it never completes.