Best practices to process 500k+ requests - multithreading

I finished my first Python RESTful API (with Flask RESTPlus) few days ago, and I wrote a small program to test it:
if __name__ == '__main__':
with open('dataset.csv') as dataset:
reader = csv.DictReader(dataset)
nb_requests = 0
for row in reader:
data = json.dumps(row)
nb_requests += 1
requests.post(url=url, data=data, headers=header)
The problem is the following :
I have a quite huge CSV dataset that I need to test (500k+ lines) and I need to make a POST request to my API for each line inside it.
As expected it's slow because both program are synchronous and I was wondering what would be the best practices to make it faster ?
I read about multithreading, multiprocessing, asyncio ... But I actually don't know what would be the best solution in order to make the API and my testing program faster. Any suggestions ?
Thanks for your lights!

Related

how to use Flask with multiprocessing

Concretely, I'm using Flask to process a request, pseudocode like this:
from flask import Flask, request
app = Flask(__name__)
#app.route("/foo", methods=["POST"])
def foo():
data = request.get_json() # {"request_id": "abc", "data": "some text"}
result_a = do_task_a(data) # returns {"result_a": "a"}, maybe about 1 second to finish
result_b = do_task_b(data) # returns {"result_b": "b"}, maybe about 1 second to finish
result_c = do_task_c(data) # returns {"result_c": "c"}, maybe about 1 second to finish
result = {
"result_a": result_a["result_a"],
"result_b": result_b["result_b"],
"result_c": result_c["result_c"]}
return result
app.run(host='0.0.0.0', port=4000, threaded=False)
Here, do_task_a, do_task_b, do_task_c are completely independent subtasks, I know I can use multiprocessing.Process to create processes to finish these three subtasks, and use join() to wait for subtask done, But I don't know it's proper way to create Process for every request?
Maybe I can use multiprocessing.Queue to help, But I don't find a good way.
I search for multiprocessing, but can't figure out a good solution.
I'm not a python guy, but indeed creating processes is sn expensive operation
If its possible - create threads they're cheaper than processes.
If you run the request multiple times - you can do even better than that, because creating threads per request is still quite expensive
Even more advanced setup is to create a "pre-loaded" thread pool. Like N threads that you always keep in memory ready for running arriving task.
In terms of technical solution I've found This article that explains how to create thread pools in python 3.2+

Performance difference between multithread using queue and futures.ThreadPoolExecutor using list in python3?

I was trying various approaches with python multi-threading to see which one fits my requirements. To give an overview, I have a bunch of items that I need to send to an API. Then based on the response, some of the items will go to a database and all the items will be logged; e.g., for an item if the API returns success, that item will only be logged but when it returns failure, that item will be sent to database for future retry along with logging.
Now based on the API response I can separate out success items from failure and make a batch query with all failure items, which will improve my database performance. To do that, I am accumulating all requests at one place and trying to perform multithreaded API calls(since this is an IO bound task, I'm not even thinking about multiprocessing) but at the same time I need to keep track of which response belongs to which request.
Coming to the actual question, I tried two different approaches which I thought would give nearly identical performance, but there turned out to be a huge difference.
To simulate the API call, I created an API in my localhost with a 500ms sleep(for avg processing time). Please note that I want to start logging and inserting to database after all API calls are complete.
Approach - 1(With threading.Thread and queue.Queue())
import requests
import datetime
import threading
import queue
def target(data_q):
while not data_q.empty():
data_q.get()
response = requests.get("https://postman-echo.com/get?foo1=bar1&foo2=bar2")
print(response.status_code)
data_q.task_done()
if __name__ == "__main__":
data_q = queue.Queue()
for i in range(0, 20):
data_q.put(i)
start = datetime.datetime.now()
num_thread = 5
for _ in range(num_thread):
worker = threading.Thread(target=target(data_q))
worker.start()
data_q.join()
print('Time taken multi-threading: '+str(datetime.datetime.now() - start))
I tried with 5, 10, 20 and 30 times and the results are below correspondingly,
Time taken multi-threading: 0:00:06.625710
Time taken multi-threading: 0:00:13.326969
Time taken multi-threading: 0:00:26.435534
Time taken multi-threading: 0:00:40.737406
What shocked me here is, I tried the same without multi-threading and got almost same performance.
Then after some googling around, I was introduced to futures module.
Approach - 2(Using concurrent.futures)
def fetch_url(im_url):
try:
response = requests.get(im_url)
return response.status_code
except Exception as e:
traceback.print_exc()
if __name__ == "__main__":
data = []
for i in range(0, 20):
data.append(i)
start = datetime.datetime.now()
urls = ["https://postman-echo.com/get?foo1=bar1&foo2=bar2" + str(item) for item in data]
with futures.ThreadPoolExecutor(max_workers=5) as executor:
responses = executor.map(fetch_url, urls)
for ret in responses:
print(ret)
print('Time taken future concurrent: ' + str(datetime.datetime.now() - start))
Again with 5, 10, 20 and 30 times and the results are below correspondingly,
Time taken future concurrent: 0:00:01.276891
Time taken future concurrent: 0:00:02.635949
Time taken future concurrent: 0:00:05.073299
Time taken future concurrent: 0:00:07.296873
Now I've heard about asyncio, but I've not used it yet. I've also read that it gives even better performance than futures.ThreadPoolExecutor().
Final question, If both approaches are using threads(or so I think) then why there is a huge performance gap? Am I doing something terribly wrong? I looked around. Was not able to find a satisfying answer. Any thoughts on this would be highly appreciated. Thanks for going through the question.
[Edit 1]The whole thing is running on python 3.8.
[Edit 2] Updated code examples and execution times. Now they should run on anyone's system.
The documentation of ThreadPoolExecutor explains in detail how many threads are started when the max_workers parameter is not given, as in your example. The behaviour is different depending on the exact Python version, but the number of tasks started is most probably more than 3, the number of threads in the first version using a queue. You should use futures.ThreadPoolExecutor(max_workers= 3) to compare the two approaches.
To the updated Approach - 1 I suggest to modify the for loop a bit:
for _ in range(num_thread):
target_to_run= target(data_q)
print('target to run: {}'.format(target_to_run))
worker = threading.Thread(target= target_to_run)
worker.start()
The output will be like this:
200
...
200
200
target to run: None
target to run: None
target to run: None
target to run: None
target to run: None
Time taken multi-threading: 0:00:10.846368
The problem is that the Thread constructor expects a callable object or None as its target. You do not give it a callable, rather queue processing happens on the first invocation of target(data_q) by the main thread, and 5 threads are started that do nothing because their target is None.

How to get more events running with asyncio

I am currently learning asynchronous programming and wrote this program that will send requests asynchronously with python 3 asyncio.
When running it, my program is not that fast, I was trying to figure out how to do better.
To find the number of events running I thought about checking my kernel task thread number in the Activity Monitor. It appears I am only running 222 threads for a total of 2% of he CPU.
Is there a way to max the thread count ?
Can I make it faster by having a cleaner code ? As seen below my code is working but kind of hacky.
import asyncio
import requests
def main():
loop = asyncio.get_event_loop()
for i in enumerate(list):
f = loop.run_in_executor(make_request())
if i == end:
response = yield from f
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
Thank you.

Threads will not close off after program completion

I have a script that receives temperature data via using requests. Since I had to make multiple requests (around 13000) I decided to explore the use of multi-threading which I am new at.
The programs work by grabbing longitude/latitude data from a csv file and then makes a request to retrieve the temperature data.
The problem that I am facing is that the script does not finish fully when the last temperature value is retrieved.
Here is the code. I have shortened so it is easy to see what I am doing:
num_threads = 16
q = Queue(maxsize=0)
def get_temp(q):
while not q.empty():
work = q.get()
if work is None:
break
## rest of my code here
q.task_done()
At main:
def main():
for o in range(num_threads):
logging.debug('Starting Thread %s', o)
worker = threading.Thread(target=get_temp, args=(q,))
worker.setDaemon(True)
worker.start()
logging.info("Main Thread Waiting")
q.join()
logging.info("Job complete!")
I do not see any errors on the console and temperature is being successfully being written to another file. I have a tried running a test csv file with only a few longitude/latitude references and the script seems to finish executing fine.
So is there a way of shedding light as to what might be happening in the background? I am using Python 3.7.3 on PyCharm 2019.1 on Linux Mint 19.1.
the .join() function waits for all threads to join before continuing to the next line

Read a file at a fixed interval using apscheduler [duplicate]

This question already has answers here:
Executing periodic actions [duplicate]
(9 answers)
Closed 3 years ago.
I want to read a file line by line and output each line at a fixed interval .
The purpose of the script is to replay some GPS log files whilst updating the time/date fields as the software I'm testing rejects messages if they are too far out from the system time.
I'm attempting to use apscheduler for this as I wanted the output rate to be as close to 10Hz as reasonably possible and this didn't seem achievable with simple sleep commands.
I'm new to Python so I can get a little stuck on the scope of variables with tasks like this. The closest I've come to making this work is by just reading lines from the file object in my scheduled function.
import sys, os
from apscheduler.schedulers.blocking import BlockingScheduler
def emitRMC():
line = route.readline()
if line == b'':
route.seek(0)
line = route.readline()
print(line)
if __name__ == '__main__':
route = open("testfile.txt", "rb")
scheduler = BlockingScheduler()
scheduler.add_executor('processpool')
scheduler.add_job(emitRMC, 'interval', seconds=0.1)
scheduler.start()
However this doesn't really feel like the correct way of proceeding and I'm also seeing each input line being repeated 10 times at the output which I can't explain.
The repetition seemed very consistent and I thought it might be due max_workers although I've since set that to 1 without any impact.
I also changed the interval as 10Hz and the 10x repetition felt like it could be something more than a coincidence.
Usually when I get stuck like this it means I'm heading off in the wrong direction and I need pointing to a smarter approach so all advice will be welcome.
I found a simple solution here Executing periodic actions in Python [duplicate]
with this code from Micheal Anderson which works in a single thread.
import datetime, threading, time
def foo():
next_call = time.time()
while True:
print datetime.datetime.now()
next_call = next_call+1;
time.sleep(next_call - time.time())
timerThread = threading.Thread(target=foo)
timerThread.start()

Resources