Multithreaded HTTP GET requests slow down badly after ~900 downloads - multithreading

I'm attempting to download around 3,000 files (each being maybe 3 MB in size) from Amazon S3 using requests_futures, but the download slows down badly after about 900, and actually starts to run slower than a basic for-loop.
It doesn't appear that I'm running out of memory or CPU bandwidth. It does, however, seem like the Wifi connection on my machine slows to almost nothing: I drop from a few thousand packets/sec to just 3-4. The weirdest part is that I can't load any websites until the Python process exits and I restart my wifi adapter.
What in the world could be causing this, and how can I go about debugging it?
If it helps, here's my Python code:
import requests
from requests_futures.sessions import FuturesSession
from concurrent.futures import ThreadPoolExecutor, as_completed
# get a nice progress bar
from tqdm import tqdm
def download_threaded(urls, thread_pool, session):
futures_session = FuturesSession(executor=thread_pool, session=session)
futures_mapping = {}
for i, url in enumerate(urls):
future = futures_session.get(url)
futures_mapping[future] = i
results = [None] * len(futures_mapping)
with tqdm(total=len(futures_mapping), desc="Downloading") as progress:
for future in as_completed(futures_mapping):
try:
response = future.result()
result = response.text
except Exception as e:
result = e
i = futures_mapping[future]
results[i] = result
progress.update()
return results
s3_paths = [] # some big list of file paths on Amazon S3
def make_s3_url(path):
return "https://{}.s3.amazonaws.com/{}".format(BUCKET_NAME, path)
urls = map(make_s3_url, s3_paths)
with ThreadPoolExecutor() as thread_pool:
with requests.session() as session:
results = download_threaded(urls, thread_pool, session)
Edit with various things I've tried:
time.sleep(0.25) after every future.result() (performance degrades sharply around 900)
4 threads instead of the default 20 (performance degrades more gradually, but still degrades to basically nothing)
1 thread (performance degrades sharply around 900, but recovers intermittently)
ProcessPoolExecutor instead of ThreadPoolExecutor (performance degrades sharply around 900)
calling raise_for_status() to throw an exception whenever the status is greater than 200, then catching this exception by printing it as a warning (no warnings appear)
use ethernet instead of wifi, on a totally different network (no change)
creating futures in a normal requests session instead of using a FutureSession (this is what I did originally, and found requests_futures while trying to fix the issue)
running the download only only a narrow range of files around the failure point (e.g. file 850 through file 950) -- performance is just fine here, print(response.status_code) shows 200 all the way, and no exceptions are caught.
For what it's worth, I have previously been able to download ~1500 files from S3 in about 4 seconds using a similar method, albeit with files an order of magnitude smaller
Things I will try when I have time today:
Using a for-loop
Using Curl in the shell
Using Curl + Parallel in the shell
Using urllib2
Edit: it looks like the number of threads is stable, but when the performance starts to go bad the number of "Idle Wake Ups" appears to spike from a few hundred to a few thousand. What does that number mean, and can I use it to solve this problem?
Edit 2 from the future: I never ended up figuring out this problem. Instead of doing it all in one application, I just chunked the list of files and ran each chunk with a separate Python invocation in a separate terminal window. Ugly but effective! The cause of the problem will forever be a mystery, but I assume it was some kind of problem deep in the networking stack of my work machine at the time.

This isn't a surprise.
You don't get any parallelism when you have more threads than cores.
You can prove this to yourself by simplifying the problem to a single core with multiple threads.
What happens? You can only have one thread running at a time, so the operating system context switches each thread to give everyone a turn. One thread works, the others sleep until they are woken up in turn to do their bit. In that case you can't do better than single thread.
You may do worse because context switching and memory allocated for each thread (1MB each) have a price, too.
Read up on Amdahl's Law.

Related

Handling RabbitMQ heartbeats when cpu is loaded 100% for a long time

I'm using pika 1.1 and graph-tool 3.4 in my python application. It consumes tasks from RabbitMQ, which then used to build graphs with graph-tool and then runs some calculations.
Some of the calculations, such as betweenness, take a lot of cpu power which make cpu usage hit 100% for a long time. Sometimes rabbitmq connection drops down, which causes task to start from the beginning.
Even though calculations are run in a separate process, my guess is during the time cpu is loaded 100%, it can't find any opportunity to send a heartbeat to rabbitmq, which causes connection to terminate. This doesn't happen all the time, which indicates by chance it could send heartbeats time to time. This is only my guess, I am not sure what else can cause this.
I tried lowering the priority of the calculation process using nice(19), which didn't work. I'm assuming it's not affecting the processes spawned by graph-tool, which parallelizes work on its own.
Since it's just one line of code, graph.calculate_betweenness(... I don't have a place to manually send heartbeats or slow the execution down to create chance for heartbeats.
Can my guess about heartbeats not getting sent because cpu is super busy be correct?
If yes, how can I handle this scenario?
Answering to your questions:
Yes, that's basically it.
The solution we do is creating a separate process for the CPU intensive tasks.
import time
from multiprocessing import Process
import pika
connection = pika.BlockingConnection(
pika.ConnectionParameters(host='localhost'))
channel = connection.channel()
channel.exchange_declare(exchange='logs', exchange_type='fanout')
result = channel.queue_declare(queue='', exclusive=True)
queue_name = result.method.queue
channel.queue_bind(exchange='logs', queue=queue_name)
def cpu_intensive_task(ch, method, properties, body):
def work(body):
time.sleep(60) # If I remember well default HB is 30 seconds
print(" [x] %r" % body)
p = Process(target=work, args=(body,))
p.start()
# Important to notice if you do p.join() You will have the same problem.
channel.basic_consume(
queue=queue_name, on_message_callback=cpu_intensive_task, auto_ack=True)
channel.start_consuming()
I wonder if this is the best solution to this problem or if rabbitMQ is the best tool for CPU intensive tasks. (For really long CPU intensive tasks (more than 30 min) if you send manual ACK you will need to handle with this also: https://www.rabbitmq.com/consumers.html#acknowledgement-timeout)

Performance difference between multithread using queue and futures.ThreadPoolExecutor using list in python3?

I was trying various approaches with python multi-threading to see which one fits my requirements. To give an overview, I have a bunch of items that I need to send to an API. Then based on the response, some of the items will go to a database and all the items will be logged; e.g., for an item if the API returns success, that item will only be logged but when it returns failure, that item will be sent to database for future retry along with logging.
Now based on the API response I can separate out success items from failure and make a batch query with all failure items, which will improve my database performance. To do that, I am accumulating all requests at one place and trying to perform multithreaded API calls(since this is an IO bound task, I'm not even thinking about multiprocessing) but at the same time I need to keep track of which response belongs to which request.
Coming to the actual question, I tried two different approaches which I thought would give nearly identical performance, but there turned out to be a huge difference.
To simulate the API call, I created an API in my localhost with a 500ms sleep(for avg processing time). Please note that I want to start logging and inserting to database after all API calls are complete.
Approach - 1(With threading.Thread and queue.Queue())
import requests
import datetime
import threading
import queue
def target(data_q):
while not data_q.empty():
data_q.get()
response = requests.get("https://postman-echo.com/get?foo1=bar1&foo2=bar2")
print(response.status_code)
data_q.task_done()
if __name__ == "__main__":
data_q = queue.Queue()
for i in range(0, 20):
data_q.put(i)
start = datetime.datetime.now()
num_thread = 5
for _ in range(num_thread):
worker = threading.Thread(target=target(data_q))
worker.start()
data_q.join()
print('Time taken multi-threading: '+str(datetime.datetime.now() - start))
I tried with 5, 10, 20 and 30 times and the results are below correspondingly,
Time taken multi-threading: 0:00:06.625710
Time taken multi-threading: 0:00:13.326969
Time taken multi-threading: 0:00:26.435534
Time taken multi-threading: 0:00:40.737406
What shocked me here is, I tried the same without multi-threading and got almost same performance.
Then after some googling around, I was introduced to futures module.
Approach - 2(Using concurrent.futures)
def fetch_url(im_url):
try:
response = requests.get(im_url)
return response.status_code
except Exception as e:
traceback.print_exc()
if __name__ == "__main__":
data = []
for i in range(0, 20):
data.append(i)
start = datetime.datetime.now()
urls = ["https://postman-echo.com/get?foo1=bar1&foo2=bar2" + str(item) for item in data]
with futures.ThreadPoolExecutor(max_workers=5) as executor:
responses = executor.map(fetch_url, urls)
for ret in responses:
print(ret)
print('Time taken future concurrent: ' + str(datetime.datetime.now() - start))
Again with 5, 10, 20 and 30 times and the results are below correspondingly,
Time taken future concurrent: 0:00:01.276891
Time taken future concurrent: 0:00:02.635949
Time taken future concurrent: 0:00:05.073299
Time taken future concurrent: 0:00:07.296873
Now I've heard about asyncio, but I've not used it yet. I've also read that it gives even better performance than futures.ThreadPoolExecutor().
Final question, If both approaches are using threads(or so I think) then why there is a huge performance gap? Am I doing something terribly wrong? I looked around. Was not able to find a satisfying answer. Any thoughts on this would be highly appreciated. Thanks for going through the question.
[Edit 1]The whole thing is running on python 3.8.
[Edit 2] Updated code examples and execution times. Now they should run on anyone's system.
The documentation of ThreadPoolExecutor explains in detail how many threads are started when the max_workers parameter is not given, as in your example. The behaviour is different depending on the exact Python version, but the number of tasks started is most probably more than 3, the number of threads in the first version using a queue. You should use futures.ThreadPoolExecutor(max_workers= 3) to compare the two approaches.
To the updated Approach - 1 I suggest to modify the for loop a bit:
for _ in range(num_thread):
target_to_run= target(data_q)
print('target to run: {}'.format(target_to_run))
worker = threading.Thread(target= target_to_run)
worker.start()
The output will be like this:
200
...
200
200
target to run: None
target to run: None
target to run: None
target to run: None
target to run: None
Time taken multi-threading: 0:00:10.846368
The problem is that the Thread constructor expects a callable object or None as its target. You do not give it a callable, rather queue processing happens on the first invocation of target(data_q) by the main thread, and 5 threads are started that do nothing because their target is None.

What is the best concurrency way of doing 10 000 continuous opencv operations simultaneously in Python3?

I used to have a relatively simple Python3 app that read a video streaming source and did continuous opencv and I/O-heavy (with files and databases) operations:
cap_vid = cv2.VideoCapture(stream_url)
while True:
# ...
# openCV operations
# database I/O operations
# file I/O operations
# ...
The app ran smoothly. However, there arose a need to do this not just with 1 channel, but with many, potentially 10 000 channels. Now, let's say I have a list of these channels (stream_urls). If I wrap my usual code inside for stream_url in stream_urls:, it will of course not work, because the iteration will never proceed further than the 0th index. So, the first thing that comes to mind is concurrency.
Now, as much as I know, there are 3 ways of doing concurrent programming in Python3: threading, asyncio, and multiprocessing:
I understand that in the case of multiprocessing the OS creates new (instances of) Python interpreter so there can be at most as many instances as there are cores of the machine, which seldom exceeds 16; however, the number of processes can be potentially up to 10 000. Also, the overhead from the use of multiprocessing exceeds the performance gains if the number of processes are more than a certain amount, so this one appears to be useless.
The case of threading seems the easiest in terms of the machinery it uses. I'd just wrap my code in a function and create threads like the following:
from threading import Thread
def work_channel(ch_src):
cap_vid = cv2.VideoCapture(ch_src)
while True:
# ...
# openCV operations
# database I/O operations
# file I/O operations
# ...
for stream_url in stream_urls:
Thread(target=work_channel, args=(stream_url,), daemon=True).start()
But there are a few problems with threading: first, using more than 11-17 threads nullifies any of its favourable effects because of the overhead costs. Also, it's not safe to work with file I/O in threading, which is a very important concern for me.
I don't know how to use asyncio and couldn't find how to do what I want with it.
Given the above scenario, which one of the 3 (or more if there are other methods that I am unaware of) concurrency methods should I use for the fastest, most accurate (or at least expected) performance? And what way should I use that method correctly?
Any help is appreciated.
Spawning more threads than your computer has cores is very much possible.
Could be as simple as:
import threading
all_urls = [your url list]
def threadfunction(url):
cap_vid = cv2.VideoCapture(url)
while True:
# ...
# openCV operations
# file I/O operations
# ...
for stream_url in all_urls:
threading.Thread(target=threadfunction, args=(stream_url,)).start()

Python multiprocessing taking the brakes off OSX

I have a program that randomly selects 13 cards from a full pack and analyses the hands for shape, point count and some other features important to the game of bridge. The program will select and analyse 10**7 hands in about 5 minutes. Checking the Activity Monitor shows that during execution the CPU (which s a 6 Core processor) is devoting about 9% of its time to the program and ~90% of its time it is idle. So it looks like a prime candidate for multiprocessing and I created a multiprocessing version using a Queue to pass information from each process back to the main program. Having navigated the problems of IDLE not working will multiprocessing (I now run it using PyCharm) and that doing a join on a process before it has finished freezes the program, I got it to work.
However, it doesn’t matter how many processes I use 5,10, 25 or 50 the result is always the same. The CPU devotes about 18% of its time to the program and has ~75% of its time idle and the execution time is slightly more than double at a bit over 10 minutes.
Can anyone explain how I can get the processes to take up more of the CPU time and how I can get the execution time to reflect this? Below are the relevant sections fo the program:
import random
import collections
import datetime
import time
from math import log10
from multiprocessing import Process, Queue
NUM_OF_HANDS = 10**6
NUM_OF_PROCESSES = 25
def analyse_hands(numofhands, q):
#code remove as not relevant to the problem
q.put((distribution, points, notrumps))
if __name__ == '__main__':
processlist = []
q = Queue()
handsperprocess = NUM_OF_HANDS // NUM_OF_PROCESSES
print(handsperprocess)
# Set up the processes and get them to do their stuff
start_time = time.time()
for _ in range(NUM_OF_PROCESSES):
p = Process(target=analyse_hands, args=((handsperprocess, q)))
processlist.append(p)
p.start()
# Allow q to get a few items
time.sleep(.05)
while not q.empty():
while not q.empty():
#code remove as not relevant to the problem
# Allow q to be refreshed so allowing all processes to finish before
# doing a join. It seems that doing a join before a process is
# finished will cause the program to lock
time.sleep(.05)
counter['empty'] += 1
for p in processlist:
p.join()
while not q.empty():
# This is never executed as all the processes have finished and q
# emptied before the join command above.
#code remove as not relevant to the problem
finish_time = time.time()
I have no answer to the reason why IDLE will not run a multiprocessor start instruction correctly but I believe the answer to the doubling of the execution times lies in the type of problem I am dealing with. Perhaps others can comment but it seems to me that the overhead involved with adding and removing items to and from the Queue is quite high so that performance improvements will be best achieved when the amount of data being passed via the Queue is small compared with the amount of processing required to obtain that data.
In my program I am creating and passing 10**7 items of data and I suppose it is the overhead of passing this number of items via the Queue that kills any performance improvement from getting the data via separate Processes. By using a map it seems all 10^7 items of data will need to be stored in the map before any further processing can be done. This might improve performance depending on the overhead of using the map and dealing with that amount of data but for the time being I will stick with my original vanilla, single processed code.

improve python tail log when a log of data write in one second

I writer a method to tail log
eg
def getTailLog(self):
with open(self.strFileName, 'rb') as fileObj:
pos = fileObj.seek(0, os.SEEK_END)
try:
while True:
if self.booleanGetTailExit:
break
strLineContent = fileObj.readline()
if not strLineContent:
continue
else:
yield strLineContent.decode('utf-8').strip('\n')
except KeyboardInterrupt:
pass
this method can tail log, but will delay even stuck when massive data writer into log file in one second
so how can i repair
thanks a lot
To be honest I do not fully understand what you mean by delay even stuck when massive data writer into log file in one second.
Your code contains while loop which can potientially run forever. Looks like your code waits for the line to be appended to the end of the file self.strFileName. The problem is that it does not just wait. It continuously checks the content of the file. This is so called CPU bound operation which may cause huge delays in reading/writing within the same process (up to 10 seconds for 100 KB binary file from my experience). Python has this behavior because of GIL (global interpreter lock).
To solve your problem you should replace while loop implementation with another one - you may use schedule (at least pauses between consecutive checks) or event driven approach (if you know when new lines are added to the file).

Resources