Multi Thread Requests Python3 - python-3.x

I have researched a lot on this topic but the problem is am not able to figure out how to send multi-threading post requests using python3
names = ["dfg","dddfg","qwed"]
for name in names :
res = requests.post(url,data=name)
res.text
Here I want to send all these names and I want to use multi threading to make it faster.

Solution 1 - concurrent.futures.ThreadPoolExecutor fixed number of threads
Using a custom function (request_post) you can do almost anything.
import concurrent
import requests
def request_post(url, data):
return requests.post(url, data=data)
with concurrent.futures.ThreadPoolExecutor() as executor: # optimally defined number of threads
res = [executor.submit(request_post, url, data) for data in names]
concurrent.futures.wait(res)
res will be list of request.Response for each request made wrapped on Future instances. To access the request.Response you need to use res[index].result() where index size is len(names).
Future objects give you better control on the responses received, like if it completed correctly or there was an exception or time-out etc. More about here
You don't take risk of problems related to high number of threads (solution 2).
Solution 2 - multiprocessing.dummy.Pool and spawn one thread for each request
Might be usefull if you are not requesting a lot of pages and also or if the response time is quite slow.
from multiprocessing.dummy import Pool as ThreadPool
import itertools
import requests
with ThreadPool(len(names)) as pool: # creates a Pool of 3 threads
res = pool.starmap(requests.post(itertools.repeat(url),names))
pool.starmap - is used to pass (map) multiple arguments to one function (requests.post) that is gonna be called by a list of Threads (ThreadPool). It will return a list of request.Response for each request made.
intertools.repeat(url) is needed to make the first argument be repeated the same number of threads being created.
names is the second argument of requests.post so it's gonna work without needing to explicitly use the optional parameter data. Its len must be the same of the number of threads being created.
This code will not work if you needed to call another parameter like an optional one

Related

how to use Flask with multiprocessing

Concretely, I'm using Flask to process a request, pseudocode like this:
from flask import Flask, request
app = Flask(__name__)
#app.route("/foo", methods=["POST"])
def foo():
data = request.get_json() # {"request_id": "abc", "data": "some text"}
result_a = do_task_a(data) # returns {"result_a": "a"}, maybe about 1 second to finish
result_b = do_task_b(data) # returns {"result_b": "b"}, maybe about 1 second to finish
result_c = do_task_c(data) # returns {"result_c": "c"}, maybe about 1 second to finish
result = {
"result_a": result_a["result_a"],
"result_b": result_b["result_b"],
"result_c": result_c["result_c"]}
return result
app.run(host='0.0.0.0', port=4000, threaded=False)
Here, do_task_a, do_task_b, do_task_c are completely independent subtasks, I know I can use multiprocessing.Process to create processes to finish these three subtasks, and use join() to wait for subtask done, But I don't know it's proper way to create Process for every request?
Maybe I can use multiprocessing.Queue to help, But I don't find a good way.
I search for multiprocessing, but can't figure out a good solution.
I'm not a python guy, but indeed creating processes is sn expensive operation
If its possible - create threads they're cheaper than processes.
If you run the request multiple times - you can do even better than that, because creating threads per request is still quite expensive
Even more advanced setup is to create a "pre-loaded" thread pool. Like N threads that you always keep in memory ready for running arriving task.
In terms of technical solution I've found This article that explains how to create thread pools in python 3.2+

how limit request per second with httpx [Python 3.6]

My project consists of consuming an api that is built on top of the aws lambda service. Technically, the leader who built the api tells me that there is no fixed request limit since the service is elastic, but it is important to take into account the number of requests per second that the api can support.
To control the limit of requests per second (concurrently), the python script that I am developing uses asyncio and httpx to consume the api concurrently, and taking advantage of the max_connections parameter of httpx.Limits I am trying to find the optimal value so that the api does not freeze.
My problem is that I don't know if I am misinterpreting the use of the max_connections parameter, since when testing with a value of 1000, my understanding tells me that per second I am making 1000 requests concurrently to the api, but even so, the api after a certain time freezes.
I would like to be able to control the limit of requests per second without the need to use third-party libraries.
How could I do it?
Here is my MWE
async def consume(client, endpoint: str = '/create', reg):
data = {"param1": reg[1]}
response = await client.post(url=endpoint, data=json.dumps(data))
return response.json()
async def run(self, regs):
# Empty list to consolidate all responses
results = []
# httpx limits configuration
limits = httpx.Limits(max_keepalive_connections=None, max_connections=1000)
timeout = httpx.Timeout(connect=60.0, read=30.0, write=30.0, pool=60.0)
# httpx client context
async with httpx.AsyncClient(base_url='https://apiexample', headers={'Content-Type': 'application/json'},
limits=limits, timeout=timeout) as client:
# regs is a list of more than 1000000 tuples
tasks = [asyncio.ensure_future(consume(client=client, reg=reg))
for reg in regs]
result = await asyncio.gather(*tasks)
results += result
return results
Thanks in advance.
Your leader is wrong - there is a request limit for AWS lambda (it's 1000 concurrent executions by default).
AWS API is highly unlikely to "freeze" (there are many layers of protection), so I would look for a problem on your side.
Start debugging, by lowering the concurent connections setting (e.g. 100), and explore other settings if this doesn't fix the issue..
More info: https://www.bluematador.com/blog/why-aws-lambda-throttles-functions

Performance difference between multithread using queue and futures.ThreadPoolExecutor using list in python3?

I was trying various approaches with python multi-threading to see which one fits my requirements. To give an overview, I have a bunch of items that I need to send to an API. Then based on the response, some of the items will go to a database and all the items will be logged; e.g., for an item if the API returns success, that item will only be logged but when it returns failure, that item will be sent to database for future retry along with logging.
Now based on the API response I can separate out success items from failure and make a batch query with all failure items, which will improve my database performance. To do that, I am accumulating all requests at one place and trying to perform multithreaded API calls(since this is an IO bound task, I'm not even thinking about multiprocessing) but at the same time I need to keep track of which response belongs to which request.
Coming to the actual question, I tried two different approaches which I thought would give nearly identical performance, but there turned out to be a huge difference.
To simulate the API call, I created an API in my localhost with a 500ms sleep(for avg processing time). Please note that I want to start logging and inserting to database after all API calls are complete.
Approach - 1(With threading.Thread and queue.Queue())
import requests
import datetime
import threading
import queue
def target(data_q):
while not data_q.empty():
data_q.get()
response = requests.get("https://postman-echo.com/get?foo1=bar1&foo2=bar2")
print(response.status_code)
data_q.task_done()
if __name__ == "__main__":
data_q = queue.Queue()
for i in range(0, 20):
data_q.put(i)
start = datetime.datetime.now()
num_thread = 5
for _ in range(num_thread):
worker = threading.Thread(target=target(data_q))
worker.start()
data_q.join()
print('Time taken multi-threading: '+str(datetime.datetime.now() - start))
I tried with 5, 10, 20 and 30 times and the results are below correspondingly,
Time taken multi-threading: 0:00:06.625710
Time taken multi-threading: 0:00:13.326969
Time taken multi-threading: 0:00:26.435534
Time taken multi-threading: 0:00:40.737406
What shocked me here is, I tried the same without multi-threading and got almost same performance.
Then after some googling around, I was introduced to futures module.
Approach - 2(Using concurrent.futures)
def fetch_url(im_url):
try:
response = requests.get(im_url)
return response.status_code
except Exception as e:
traceback.print_exc()
if __name__ == "__main__":
data = []
for i in range(0, 20):
data.append(i)
start = datetime.datetime.now()
urls = ["https://postman-echo.com/get?foo1=bar1&foo2=bar2" + str(item) for item in data]
with futures.ThreadPoolExecutor(max_workers=5) as executor:
responses = executor.map(fetch_url, urls)
for ret in responses:
print(ret)
print('Time taken future concurrent: ' + str(datetime.datetime.now() - start))
Again with 5, 10, 20 and 30 times and the results are below correspondingly,
Time taken future concurrent: 0:00:01.276891
Time taken future concurrent: 0:00:02.635949
Time taken future concurrent: 0:00:05.073299
Time taken future concurrent: 0:00:07.296873
Now I've heard about asyncio, but I've not used it yet. I've also read that it gives even better performance than futures.ThreadPoolExecutor().
Final question, If both approaches are using threads(or so I think) then why there is a huge performance gap? Am I doing something terribly wrong? I looked around. Was not able to find a satisfying answer. Any thoughts on this would be highly appreciated. Thanks for going through the question.
[Edit 1]The whole thing is running on python 3.8.
[Edit 2] Updated code examples and execution times. Now they should run on anyone's system.
The documentation of ThreadPoolExecutor explains in detail how many threads are started when the max_workers parameter is not given, as in your example. The behaviour is different depending on the exact Python version, but the number of tasks started is most probably more than 3, the number of threads in the first version using a queue. You should use futures.ThreadPoolExecutor(max_workers= 3) to compare the two approaches.
To the updated Approach - 1 I suggest to modify the for loop a bit:
for _ in range(num_thread):
target_to_run= target(data_q)
print('target to run: {}'.format(target_to_run))
worker = threading.Thread(target= target_to_run)
worker.start()
The output will be like this:
200
...
200
200
target to run: None
target to run: None
target to run: None
target to run: None
target to run: None
Time taken multi-threading: 0:00:10.846368
The problem is that the Thread constructor expects a callable object or None as its target. You do not give it a callable, rather queue processing happens on the first invocation of target(data_q) by the main thread, and 5 threads are started that do nothing because their target is None.

Cherrypy_handling requests

I've been searching for a while now but can't find an answere.
I know that cherrypy creates a new thread for handling requests (GET, PUT, POST, DELETE etc).
Now i fetch the parameters like this:
...
#cherrypy.tools.json_in()
#cherrypy.tools.json_out()
def POST(self):
Forum.lock_post.acquire()
conn = self.io.psqlConnect(self.dict_psql)
cur = conn.cursor(cursor_factory = psycopg2.extras.RealDictCursor)
params = cherrypy.request.json
...
return some_dict
As you can see im locking the thread to avoid race condition on the variable params. But is this really necessary? I'm asking cos if i do it like this all the other requests on POST will have to wait. Is there any better solution without locking the whole POST? I'm using params several times along the code.
First a clarification, CherryPy doesn't create a new thread for each requests, it has a predetermined pool of threads (10 by default), from which indeed one thread can be used to handle a single request at a time.
As for if you should lock cherrypy.request.json. You really don't, there is a concept called "thread locals" on which you can have multiple references to different objects depending on which thread is accessing such object. (python docs).
Having said that... you should make sure that the code that you write doesn't interfere with the state of the other threads (you can use the cherrypy.thread_data as a quick fix).
Take a look into the cherrypy plugin architecture, if you want a resource to be shared among threads usually a plugin is the way to: http://docs.cherrypy.org/en/latest/extend.html#plugins

Should I use coroutines or another scheduling object here?

I currently have code in the form of a generator which calls an IO-bound task. The generator actually calls sub-generators as well, so a more general solution would be appreciated.
Something like the following:
def processed_values(list_of_io_tasks):
for task in list_of_io_tasks:
value = slow_io_call(task)
yield postprocess(value) # in real version, would iterate over
# processed_values2(value) here
I have complete control over slow_io_call, and I don't care in which order I get the items from processed_values. Is there something like coroutines I can use to get the yielded results in the fastest order by turning slow_io_call into an asynchronous function and using whichever call returns fastest? I expect list_of_io_tasks to be at least thousands of entries long. I've never done any parallel work other than with explicit threading, and in particular I've never used the various forms of lightweight threading which are available.
I need to use the standard CPython implementation, and I'm running on Linux.
Sounds like you are in search of multiprocessing.Pool(), specifically the Pool.imap_unordered() method.
Here is a port of your function to use imap_unordered() to parallelize calls to slow_io_call().
def processed_values(list_of_io_tasks):
pool = multiprocessing.Pool(4) # num workers
results = pool.imap_unordered(slow_io_call, list_of_io_tasks)
while True:
yield results.next(9999999) # large time-out
Note that you could also iterate over results directly (i.e. for item in results: yield item) without a while True loop, however calling results.next() with a time-out value works around this multiprocessing keyboard interrupt bug and allows you to kill the main process and all subprocesses with Ctrl-C. Also note that the StopIteration exceptions are not caught in this function but one will be raised when results.next() has no more items return. This is legal from generator functions, such as this one, which are expected to either raise StopIteration errors when there are no more values to yield or just stop yielding and a StopIteration exception will be raised on it's behalf.
To use threads in place of processes, replace
import multiprocessing
with
import multiprocessing.dummy as multiprocessing

Resources