Pytest function testing multi-processing task queue service - python-3.x

I have a task queue processing service that I'm trying to run pytest function testing on. When running it in 'production', I start this from the command line, e.g. python main.py.
I can't figure out how to start this task service from pytest to do function testing on it. How do I start up the service inside pytest so that I can then add a job to it and see if the job gets processed and added to the database when completed?
def main():
store = "jobs"
worker_id = 1
# Process tasks
task_processing[store] = multiprocessing.Process(
target=process_tasks, args=(store, worker_id)
)
nanopub_processing[store].start()
if __name__ == "__main__":
main()

Just make sure you access the main function correctly:
from main import main
def test_main():
main()
...

Related

Python RQ-Scheduler not giving any output

I am unable to get rq_scheduler working. Here is a simple example:
app.py
from flask import Flask
import datetime
from redis import Redis
from rq import Queue
from rq_scheduler import Scheduler
from tasks import example
app=Flask(__name__)
app.secret_key='abc'
app.redis = Redis.from_url('redis://')
app.task_queue = Queue('test', connection=app.redis)
scheduler = Scheduler(queue=app.task_queue,connection=app.redis)
#app.task_queue.enqueue('tasks.example',2)
#scheduler.enqueue_at(datetime.datetime(2020,4,16,10,46), example, 2)
scheduler.enqueue_in(datetime.timedelta(seconds=1), example, 2)
if __name__=='__main__':
app.run(host='0.0.0.0', port=5000, debug=True)
tasks.py
import time
def example(seconds):
print('Starting task')
for i in range(seconds):
print(i)
time.sleep(1)
print('Task completed')
In the app directory in terminal, I start the following in separate tabs:
$redis-server
$rq worker test
$rqscheduler
$python app.py
The first queue.enqueue works fine. Both scheduler tasks do nothing. What is wrong?
I suspect that you may be getting confused because rqscheduler by default checks for new jobs every one minute. You can tweak this with the -i flag to set the interval in seconds, and also add the -v flag for more verbose output:
rqscheduler -i 1 -v
However I also noticed another issue with the above Flask code...
Probably due to the dev server spawning a separate process I was finding that the scheduler.enqueue_in function was enqueuing the job twice. This probably wouldn't be an issue if the enqueue_in function was called inside a view function. However where you have it placed this actually runs when the application is started.
So when launching with the dev server this gets executed twice. This will then run once every time the autoreloader senses a code change: So after starting the dev server, then saving a change to the code, 3 jobs total have been enqueued.
For the purpose of testing this, it may be advisable just to have a simple python script which doesn't actually run the Flask app:
# enqueue_test.py
from redis import Redis
from rq import Queue
from rq_scheduler import Scheduler
from tasks import example
r = Redis.from_url('redis://localhost:6379')
q = Queue('test', connection=r)
scheduler = Scheduler(queue=q, connection=r)
scheduler.enqueue_in(datetime.timedelta(seconds=1), example, 2)

Creating detached processes from celery worker/alternative solution?

I'm developing a web service that will be used as a "database as a service" provider. The goal is to have a flask based small web service, running on some host and "worker" processes running on different hosts owned by different teams. Whenever a team member comes and requests a new database I should create one on their host. Now the problem... The service I start must be running. The worker however might be restarted. Could happen 5 minutes could happen 5 days. A simple Popen won't do the trick because it'd create a child process and if the worker stops later on the Popen process is destroyed (I tried this).
I have an implementation that's using multiprocessing which works like a champ, sadly I cannot use this with celery. so out of luck there. I tried to get away from the multiprocessing library with double forking and named pipes. The most minimal sample I could produce:
def launcher2(working_directory, cmd, *args):
command = [cmd]
command.extend(list(args))
process = subprocess.Popen(command, cwd=working_directory, shell=False, start_new_session=True,
stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
with open(f'{working_directory}/ipc.fifo', 'wb') as wpid:
wpid.write(process.pid)
#shared_task(bind=True, name="Test")
def run(self, cmd, *args):
working_directory = '/var/tmp/workdir'
if not os.path.exists(working_directory):
os.makedirs(working_directory, mode=0o700)
ipc = f'{working_directory}/ipc.fifo'
if os.path.exists(ipc):
os.remove(ipc)
os.mkfifo(ipc)
pid1 = os.fork()
if pid1 == 0:
os.setsid()
os.umask(0)
pid2 = os.fork()
if pid2 > 0:
sys.exit(0)
os.setsid()
os.umask(0)
launcher2(working_directory, cmd, *args)
else:
with os.fdopen(os.open(ipc, flags=os.O_NONBLOCK | os.O_RDONLY), 'rb') as ripc:
readers, _, _ = select.select([ripc], [], [], 15)
if not readers:
raise TimeoutError(60, 'Timed out', ipc)
reader = readers.pop()
pid = struct.unpack('I', reader.read())[0]
pid, status = os.waitpid(pid, 0)
print(status)
if __name__ == '__main__':
async_result = run.apply_async(('/usr/bin/sleep', '15'), queue='q2')
print(async_result.get())
My usecase is more complex but I don't think anyone would want to read 200+ lines of bootstrapping, but this fails exactly on the same way. On the other hand I don't wait for the pid unless that's required so it's like start the process on request and let it do it's job. Bootstrapping a database takes roughly a minute with the full setup, and I don't want the clients standing by for a minute. Request comes in, I spawn the process and send back an id for the database instance, and the client can query the status based on the received instance id. However with the above forking solution I get:
[2020-01-20 18:03:17,760: INFO/MainProcess] Received task: Test[dbebc31c-7929-4b75-ae28-62d3f9810fd9]
[2020-01-20 18:03:20,859: ERROR/MainProcess] Process 'ForkPoolWorker-2' pid:16634 exited with 'signal 15 (SIGTERM)'
[2020-01-20 18:03:20,877: ERROR/MainProcess] Task handler raised error: WorkerLostError('Worker exited prematurely: signal 15 (SIGTERM).')
Traceback (most recent call last):
File "/home/pupsz/PycharmProjects/provider/venv37/lib/python3.7/site-packages/billiard/pool.py", line 1267, in mark_as_worker_lost
human_status(exitcode)),
billiard.exceptions.WorkerLostError: Worker exited prematurely: signal 15 (SIGTERM).
Which leaves me wondering, what might be going on. I tried an even more simple task:
#shared_task(bind=True, name="Test")
def run(self, cmd, *args):
working_directory = '/var/tmp/workdir'
if not os.path.exists(working_directory):
os.makedirs(working_directory, mode=0o700)
command = [cmd]
command.extend(list(args))
process = subprocess.Popen(command, cwd=working_directory, shell=False, start_new_session=True,
stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
return process.wait()
if __name__ == '__main__':
async_result = run.apply_async(('/usr/bin/sleep', '15'), queue='q2')
print(async_result.get())
Which again fails with the very same error. Now I like Celery but from this it feels like it's not suited for my needs. Did I mess something up? Can it be achieved, what I need to do from a worker? Do I have any alternatives, or should I just write my own task queue?
Celery is not multiprocessing-friendly, so try to use billiard instead of multiprocessing (from billiard import Process etc...) I hope one day Celery guys do a heavy refactoring of that code, remove billiard, and start using multiprocessing instead...
So, until they move to multiprocessing we are stuck with billiard. My advice is to remove any usage of multiprocessing in your Celery tasks, and start using billiard.context.Process and similar, depending on your use-case.

Is there a way to run python flask function, every specific interval of time and display on the local server the output?

I am working python program using flask, where i want to extract keys from dictionary. this keys is in text format. But I want to repeat this above whole process after every specific interval of time. And display this output on local browser each time.
I have tried this using flask_apscheduler. The program run and shows output but only once, but dose not repeat itself after interval of time.
This is python program which i tried.
#app.route('/trend', methods=['POST', 'GET'])
def run_tasks():
for i in range(0, 1):
app.apscheduler.add_job(func=getTrendingEntities, trigger='cron', args=[i], id='j'+str(i), second = 5)
return "Code run perfect"
#app.route('/loc', methods=['POST', 'GET'])
def getIntentAndSummary(self, request):
if request.method == "POST":
reqStr = request.data.decode("utf-8", "strict")
reqStrArr = reqStr.split()
reqStr = ' '.join(reqStrArr)
text_1 = []
requestBody = json.loads(reqStr)
if requestBody.get('m') is not None:
text_1.append(requestBody.get('m'))
return jsonify(text_1)
if (__name__ == "__main__"):
app.run(port = 8000)
The problem is that you're calling add_job every time the /trend page is requested. The job should only be added once, as part of the initialization, before starting the scheduler (see below).
It would also make more sense to use the 'interval' trigger instead of 'cron', since you want your job to run every 5 seconds. Here's a simple working example:
from flask import Flask
from flask_apscheduler import APScheduler
import datetime
app = Flask(__name__)
#function executed by scheduled job
def my_job(text):
print(text, str(datetime.datetime.now()))
if (__name__ == "__main__"):
scheduler = APScheduler()
scheduler.add_job(func=my_job, args=['job run'], trigger='interval', id='job', seconds=5)
scheduler.start()
app.run(port = 8000)
Sample console output:
job run 2019-03-30 12:49:55.339020
job run 2019-03-30 12:50:00.339467
job run 2019-03-30 12:50:05.343154
job run 2019-03-30 12:50:10.343579
You can then modify the job attributes by calling scheduler.modify_job().
As for the second problem which is refreshing the client view every time the job runs, you can't do that directly from Flask. An ugly but simple way would be to add <meta http-equiv="refresh" content="1" > to the HTML page to instruct the browser to refresh it every second. A much better implementation would be to use SocketIO to send new data in real-time to the web client.
I would recommend that you start a demonized thread, import your application variable, then you can use with app.app_context() in order to log into to your console.
It's a little bit more fiddly but allows the application to run separated by different threads.
I use this method to fire off a bunch of http requests concurrently. The alternative is wait for each response before making a new one.
I'm sure you've realised that the thread will become occupied of you run an infinitely running command.
Make sure to demonize the thread so that when you stop your web app it will kill the thread at the same time gracefully.

APScheduler resets after every deploy

I have a script which which when run adds rss feed parsing tasks to some celery queues. Now I have implemented apscheduler to run the script every 2 hours to get new data from the feeds.
My implementation looks like this:
#!/usr/bin/env python
import atexit
import logging
import os
from logging import getLogger
from apscheduler.schedulers.blocking import BlockingScheduler
logger = getLogger('scheduled_parser')
PARSER_SCHEDULER = 'parser_scheduler'
def main():
scheduler = BlockingScheduler(job_defaults={'coalesce': True})
scheduler.add_jobstore('sqlalchemy',alias='scheduler_config', url=os.environ.get("DATABASE_URL"))
scheduler.add_job(run_parser, 'interval', seconds=int(os.environ.get("SCHEDULER_RUN_FREQUENCY")),
id=PARSER_SCHEDULER, replace_existing=True)
scheduler.start()
atexit.register(lambda: scheduler.shutdown())
def run_parser():
< code to add items to queues>
if __name__ == "__main__":
logging.basicConfig()
logger.setLevel(logging.INFO)
main()
My code is deployed on heroku and I have following in my procfile
clock: python scheduled_parser
<celery worker processes>
I am having following issues:
I am storing the scheduler job in persistant storage and I can even see it in my db, but when I do scheduler.get_job(PARSER_SCHEDULER,'scheduler_config') I get None
Whenever I deploy on heroku, I think the next run is being updated. For example if parser is set to run every 2 hours and next run going to be at 4:00pm and if I deploy on Heroku at 3:00pm then my next run happens at 5:00pm instead of 4:00pm.
Not sure about your issue #1, but I think issue #2 is that on every deploy, this line is going to replace the job, thus resetting the schedule:
scheduler.add_job(run_parser, 'interval', seconds=int(os.environ.get("SCHEDULER_RUN_FREQUENCY")),
id=PARSER_SCHEDULER, replace_existing=True)

Dispy, initiating a SharedJobCluster on a compute node

I am creating a compute cluster in python using dispy. One of my use cases would be very nicely solved by starting a process on a compute node that itself starts a distributed process. As such, I have implemented the SharedJobCluster on the primary scheduler, and also in the function that will be sent to the cluster (which should in turn, start a series of distributed processes). However, when the second SharedJobCluster is initiated, the code hangs and does not move past this line (nor show any errors).
Minimum working example:
def clusterfun():
import dispy
import test2
import logging
log_filename = 'worker.log'
logging.basicConfig(filename=log_filename,
level=logging.DEBUG,
format='%(asctime)s %(name)-12s %(levelname)-8s %(message)s',
datefmt='[%m-%d-%Y %H:%M:%S]')
logging.info("Starting cluster...")
# THE FOLLOWING LINE HANGS
cluster = dispy.SharedJobCluster(test2.clusterfun2, port=0, scheduler_node='127.0.0.1')
logging.info("Started cluster...")
job = cluster.submit()
logging.info("Submitted job...")
return job()
if __name__ == '__main__':
import dispy
#
# Start the Compute cluster
#
cluster = dispy.SharedJobCluster(clusterfun, port=0, depends=['test2.py'], scheduler_node='127.0.0.1')
job = cluster.submit()
print(job())
test2.py contains:
def clusterfun2():
return "Foo"
For reference, I am currently running the dispyscheduler.py, dispynode, and this python code all on the same machine. This setup works, except when trying to initiate embedded distribution task.
The worker.log output contains "Starting cluster..." but nothing else.
If I check the status of the node it says that it is running 1 job, but it never completes.

Resources