Get number of active instances for BackgroundScheduler jobs - python-3.x

I have a simple BackgroundScheduler and a simple task. The BackgroundScheduler is configured to run only a single instance for that task:
from apscheduler.schedulers.background import BackgroundScheduler
scheduler.add_job(run_task, 'interval', seconds=10)
scheduler.start()
When a tasks starts, it takes much more than 10 seconds to complete and I get the warning:
Execution of job "run_tasks (trigger: interval[0:00:10], next run at: 2020-06-17 18:25:32 BST)" skipped: maximum number of running instances reached (1)
This works as expected.
My problem is that I can't find a way to check if an instance of that task is currently running.
In the docs, there are many ways to get all and individual scheduled tasks, but I can't find a way to check if a task is currently running or not.
I would ideally want something like:
def job_in_progress():
job = scheduler.get_job(id=job_id)
instances = job.get_instances()
return instances > 0
Any ideas?

Not great because you have to access a private attribute, but the only thing I could find.
def job_in_progress():
job = scheduler.get_job(id=job_id)
instances = scheduler._instances[job_id]
return instances > 0
If someone else has another idea, don't use this.

Related

how to use Flask with multiprocessing

Concretely, I'm using Flask to process a request, pseudocode like this:
from flask import Flask, request
app = Flask(__name__)
#app.route("/foo", methods=["POST"])
def foo():
data = request.get_json() # {"request_id": "abc", "data": "some text"}
result_a = do_task_a(data) # returns {"result_a": "a"}, maybe about 1 second to finish
result_b = do_task_b(data) # returns {"result_b": "b"}, maybe about 1 second to finish
result_c = do_task_c(data) # returns {"result_c": "c"}, maybe about 1 second to finish
result = {
"result_a": result_a["result_a"],
"result_b": result_b["result_b"],
"result_c": result_c["result_c"]}
return result
app.run(host='0.0.0.0', port=4000, threaded=False)
Here, do_task_a, do_task_b, do_task_c are completely independent subtasks, I know I can use multiprocessing.Process to create processes to finish these three subtasks, and use join() to wait for subtask done, But I don't know it's proper way to create Process for every request?
Maybe I can use multiprocessing.Queue to help, But I don't find a good way.
I search for multiprocessing, but can't figure out a good solution.
I'm not a python guy, but indeed creating processes is sn expensive operation
If its possible - create threads they're cheaper than processes.
If you run the request multiple times - you can do even better than that, because creating threads per request is still quite expensive
Even more advanced setup is to create a "pre-loaded" thread pool. Like N threads that you always keep in memory ready for running arriving task.
In terms of technical solution I've found This article that explains how to create thread pools in python 3.2+

Performance difference between multithread using queue and futures.ThreadPoolExecutor using list in python3?

I was trying various approaches with python multi-threading to see which one fits my requirements. To give an overview, I have a bunch of items that I need to send to an API. Then based on the response, some of the items will go to a database and all the items will be logged; e.g., for an item if the API returns success, that item will only be logged but when it returns failure, that item will be sent to database for future retry along with logging.
Now based on the API response I can separate out success items from failure and make a batch query with all failure items, which will improve my database performance. To do that, I am accumulating all requests at one place and trying to perform multithreaded API calls(since this is an IO bound task, I'm not even thinking about multiprocessing) but at the same time I need to keep track of which response belongs to which request.
Coming to the actual question, I tried two different approaches which I thought would give nearly identical performance, but there turned out to be a huge difference.
To simulate the API call, I created an API in my localhost with a 500ms sleep(for avg processing time). Please note that I want to start logging and inserting to database after all API calls are complete.
Approach - 1(With threading.Thread and queue.Queue())
import requests
import datetime
import threading
import queue
def target(data_q):
while not data_q.empty():
data_q.get()
response = requests.get("https://postman-echo.com/get?foo1=bar1&foo2=bar2")
print(response.status_code)
data_q.task_done()
if __name__ == "__main__":
data_q = queue.Queue()
for i in range(0, 20):
data_q.put(i)
start = datetime.datetime.now()
num_thread = 5
for _ in range(num_thread):
worker = threading.Thread(target=target(data_q))
worker.start()
data_q.join()
print('Time taken multi-threading: '+str(datetime.datetime.now() - start))
I tried with 5, 10, 20 and 30 times and the results are below correspondingly,
Time taken multi-threading: 0:00:06.625710
Time taken multi-threading: 0:00:13.326969
Time taken multi-threading: 0:00:26.435534
Time taken multi-threading: 0:00:40.737406
What shocked me here is, I tried the same without multi-threading and got almost same performance.
Then after some googling around, I was introduced to futures module.
Approach - 2(Using concurrent.futures)
def fetch_url(im_url):
try:
response = requests.get(im_url)
return response.status_code
except Exception as e:
traceback.print_exc()
if __name__ == "__main__":
data = []
for i in range(0, 20):
data.append(i)
start = datetime.datetime.now()
urls = ["https://postman-echo.com/get?foo1=bar1&foo2=bar2" + str(item) for item in data]
with futures.ThreadPoolExecutor(max_workers=5) as executor:
responses = executor.map(fetch_url, urls)
for ret in responses:
print(ret)
print('Time taken future concurrent: ' + str(datetime.datetime.now() - start))
Again with 5, 10, 20 and 30 times and the results are below correspondingly,
Time taken future concurrent: 0:00:01.276891
Time taken future concurrent: 0:00:02.635949
Time taken future concurrent: 0:00:05.073299
Time taken future concurrent: 0:00:07.296873
Now I've heard about asyncio, but I've not used it yet. I've also read that it gives even better performance than futures.ThreadPoolExecutor().
Final question, If both approaches are using threads(or so I think) then why there is a huge performance gap? Am I doing something terribly wrong? I looked around. Was not able to find a satisfying answer. Any thoughts on this would be highly appreciated. Thanks for going through the question.
[Edit 1]The whole thing is running on python 3.8.
[Edit 2] Updated code examples and execution times. Now they should run on anyone's system.
The documentation of ThreadPoolExecutor explains in detail how many threads are started when the max_workers parameter is not given, as in your example. The behaviour is different depending on the exact Python version, but the number of tasks started is most probably more than 3, the number of threads in the first version using a queue. You should use futures.ThreadPoolExecutor(max_workers= 3) to compare the two approaches.
To the updated Approach - 1 I suggest to modify the for loop a bit:
for _ in range(num_thread):
target_to_run= target(data_q)
print('target to run: {}'.format(target_to_run))
worker = threading.Thread(target= target_to_run)
worker.start()
The output will be like this:
200
...
200
200
target to run: None
target to run: None
target to run: None
target to run: None
target to run: None
Time taken multi-threading: 0:00:10.846368
The problem is that the Thread constructor expects a callable object or None as its target. You do not give it a callable, rather queue processing happens on the first invocation of target(data_q) by the main thread, and 5 threads are started that do nothing because their target is None.

Celery worker with multithreading - how to update results concurently

I created a Flask API with a Celery worker. User fires "start tests" button which makes a POST request that returns url which user can use to get results of tests every 5 seconds (needed to update fontend progress bar). The Celery task includes threading. My goal is to update Celery task state based on the results of threads concurently. I don't want to wait until all my threads finish to return their result. My Celery task looks like this:
#celery.task(bind=True) # bind argument instructs Celery to send a "self" argument and use it to record status updates
def run_tests(self, dialog_cases):
"""
Testing running as a background task
"""
results = []
test_case_no = 1
test_controller = TestController(dialog_cases)
bot_config = [test_controller.url, test_controller.headers, test_controller.db_name]
threads = []
queue = Queue()
start = time.perf_counter()
threads_list = list()
for test_case in test_controller.test_cases:
t = Thread(target=queue.put({randint(0,1000): TestCase(test_case, bot_config)}))
t.start()
threads_list.append(t)
for t in threads_list:
t.join()
results_dict_list = [queue.get() for _ in range(len(test_controller.test_cases))]
for result in results_dict_list:
for key, value in result.items():
cprint.info(f"{key}, {value.test_failed}")
Now: the TestCase is an object that on creation runs a function that makes a few iterations and afterwards returns whether the test failed or passed. I have another Flask endpoint which returns the status of the tasks. Question is how to get the value returned by threads simultanously without having to wait until they are all finished? I tried Queue but this can only return results when everything is over.
You can simply use update_state to modify state of the task, from each of those threads if that is what you want. Furthermore, you can create your own, custom states. As you want to know result of each test the moment it is finished, it seems like a good idea to have a custom state for teach test that you update from each thread durint runtime.
An alterantive is to refactor your code so each test is actually a Celery task. Then you use Chord or Group primitives to build your workflow. As you want to know the state during runtime, then perhaps Group is better because then you can monitor the state of the GroupResult object...

Python Async Functionality

I'm trying to figure out how the async functionality works in Python. I have watched countless videos but I guess I'm not 'getting it'. My code looks as follows:
def run_watchers():
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
loop.run_until_complete(watcher_helper())
loop.close()
async def watcher_helper():
watchers = Watcher.objects.all()
for watcher in watchers:
print("Running watcher : " + str(watcher.pk))
await watcher_helper2(watcher)
async def watcher_helper2(watcher):
for i in range(1,1000000):
x = i * 1000 / 2000
What makes sense to me is to have three functions. One to start the loop, second to iterate through the different options to execute and third to do the work.
I am expecting the following output:
Running watcher : 1
Running watcher : 2
...
...
Calculation done
Calculation done
...
...
however I am getting:
Running watcher : 1
Calculation done
Running watcher : 2
Calculation done
...
...
which obviously shows the calculations are not done in parallel. Any idea what I am doing wrong?
asyncio can be used only to speedup multiple network I/O related functions (send/receive data through internet). While you wait some data from network (which may take long time) you usually idle. Using asyncio allows you to use this idle time for another useful job: for example, to start another parallel network request.
asyncio can't somehow speedup CPU-related job (which is what watcher_helper2 do in your example). While you multiply some numbers there's simply no idle time which can be used to do something different and to achieve benefit through that.
Read also this answer for more detailed explanation.

Run test scripts in parallel in nGrinder

We are running performance tests with nGrinder. We have use cases where we would desire to run multiple test scripts in parallel.
On their website it is stated that one user can only run one test at a time. So we setup two users but I see the same behavior: only one test script is running and the others are waiting in a READY state.
Is there any way in nGrinder to run multiple test scripts in parallel?
It's only possible to run multiple test concurrently when these tests are submitted to execute by the different users if the free agents are available enough to run both tests.
I'm suspecting you don't have enough agents to run both.
You can run many scripts using one agent only . I would divide agents based on transaction groups and not on scripts.
Inside grinder there is parallel.py .I have used this only before to run scripts in parallel.
See this link https://github.com/DealerDotCom/grinder/blob/master/grinder/examples/parallel.py
from net.grinder.script.Grinder import grinder
scripts = ["TestScript1", "TestScript2", "TestScript3"]
Ensure modules are initialised in the process thread.
for script in scripts: exec("import %s" % script)
def createTestRunner(script):
exec("x = %s.TestRunner()" % script)
return x
class TestRunner:
def init(self):
tid = grinder.threadNumber
if tid % 4 == 2:
self.testRunner = createTestRunner(scripts[1])
elif tid % 4 == 3:
self.testRunner = createTestRunner(scripts[2])
else:
self.testRunner = createTestRunner(scripts[0])
# This method is called for every run.
def __call__(self):
self.testRunner()

Resources