Python Fire dynamic urls using multithreading - multithreading

I'm new to to Python-Threading, and I've gone through multiple posts but I really did not understand how to use it. However I tried to complete my task, and I want to check if I'm doing it with right approach.
Task is :
Read big CSV containing around 20K records, fetch id from each record and fire an HTTP API call for each record of the CSV.
t1 = time.time()
file_data_obj = csv.DictReader(open(file_path, 'rU'))
threads = []
for record in file_data_obj:
apiurl = https://www.api-server.com?id=record.get("acc_id", "")
thread = threading.Thread(target=requests.get, args=(apiurl,))
thread.start()
threads.append(thread)
t2 = time.time()
for thread in threads:
thread.join()
print("Total time required to process a file - {} Secs".format(t2-t1))
As there are 20K records, would it start 20K threads? OR OS/Python will handle it? If yes, can we restrict it?
How can I collect the response returned by requests.get?
Would t2 - t1 really give mw the time required to process whole file?

As there are 20K records, would it start 20K threads? OR OS/Python will handle it? If yes, can we restrict it?
Yes - it will start a thread for each iteration. The maximum amount of threads is dependent on your OS.
How can I grab the response returned by requests.get?
If you want to use the threading module only, you'll have to make use of a Queue. Threads return None by design, hence you'll have to implement a line of communication between the Thread and you main loop yourself.
from queue import Queue
from threading import Thread
import time
# A thread that produces data
q = Queue()
def return_get(q, apiurl):
q.put(requests.get(apiurl)
for record in file_data_obj:
apiurl = https://www.api-server.com?id=record.get("acc_id", "")
t = threading.Thread(target=return_get, args=(q, apiurl))
t.start()
threads.append(t)
for thread in threads:
thread.join()
while not q.empty:
r = q.get() # Fetches the first item on the queue
print(r.text)
An alternative is to use a worker pool.
from concurrent.futures import ThreadPoolExecutor
from queue import Queue
import urllib.request
threads = []
pool = ThreadPoolExecutor(10)
# Submit work to the pool
for record in file_data_obj:
apiurl = https://www.api-server.com?id=record.get("acc_id", "")
t = pool.submit(fetch_url, 'http://www.python.org')
threads.append(t)
for t in threads:
print(t.result())

You can use ThreadPoolExecutor
Retrieve a single page and report the URL and contents
def load_url(url, timeout):
with urllib.request.urlopen(url, timeout=timeout) as conn:
return conn.read()
Create pool executor with N workers
with concurrent.futures.ThreadPoolExecutor(max_workers=N_workers) as executor:
# Start the load operations and mark each future with its URL
future_to_url = {executor.submit(load_url, url, 60): url for url in URLS}
for future in concurrent.futures.as_completed(future_to_url):
url = future_to_url[future]
try:
data = future.result()
except Exception as exc:
print('%r generated an exception: %s' % (url, exc))
else:
print('%r page is %d bytes' % (url, len(data)))

Related

queue.get(block=True) while start_background_task (flask-socketio) is running and doing queue.put()

I have an issue related to queue using a background task that never ends (continuously run to grab real-time data.
What I want to achieve:
Starting server via flask-socketio (eventlet),
monkey_patch(),
Using start_background_task, run a function from another file that grabs data in real time,
While this background task is running (indefinitely), storing incoming data in a queue via queue.put(),
Always while this task is running, from the main program watching for new data in the queue and processing them, meaning here socketio.emit().
What works: my program works well if, in the background task file, the while loop ends (while count < 100: for instance). In this case, I can access the queue from the main file and emit data.
What doesn't work: if this while loop is now while True:, the program blocks somewhere, I can't access the queue from the main program as it seems that it waits until the background task returns or stops.
So I guess I'm missing something here... so if you guys can help me with that, or give me some clues, that would be awesome.
Here some relevant parts of the code:
main.py
from threading import Thread
from threading import Lock
from queue import Queue
from get_raw_program import get_raw_data
from flask import Flask, send_from_directory, Response, jsonify, request, abort
from flask_socketio import SocketIO
import eventlet
eventlet.patcher.monkey_patch(select=True, socket=True)
app = Flask(__name__, static_folder=static_folder, static_url_path='')
app.config['SECRET_KEY'] = 'secret_key'
socketio = SocketIO(app, binary=True, async_mode="eventlet", logger=True, engineio_logger=True)
thread = None
thread_lock = Lock()
data_queue = Queue()
[...]
#socketio.on('WebSocket_On')
def grab_raw_data(test):
global thread
with thread_lock:
if thread is None:
socketio.emit('raw_data', {'msg': 'Thread is None:'})
socketio.emit('raw_data', {'msg': 'Starting Thread... '})
thread = socketio.start_background_task(target=get_raw_data(data_queue, test['mode']))
while True:
if not data_queue.empty():
data = data_queue.get(block=True, timeout=0.05)
socketio.emit('raw_data', {'msg': data})
socketio.sleep(0.0001)
get_raw_program.py (which works, can access queue from main.py)
def get_raw_data(data_queue, test):
count = 0
while count < 100:
data.put(b'\xe5\xce\x04\x00\xfe\xd2\x04\x00')
time.sleep(0.001)
count += 1
get_raw_program.py (which DOESN'T work, can't access queue from main.py)
def get_raw_data(data_queue, test):
count = 0
while True:
data.put(b'\xe5\xce\x04\x00\xfe\xd2\x04\x00')
time.sleep(0.001)
count += 1
I tried with regular Thread instead of start_background_task, and it works well. Thanks again for your help, greatly appreciated :-)

python concurrent.futures skip timeout processes

I am dealing with thousands of image urls and want to use concurrent.futures.ProcessPoolExecutor to speed up.
Since some of the urls are broken or images are large, the process function may hang or unexpectedly consume a lot of time during processing. I want to add a timeout on the process function like 10 seconds to get rid of these invalid images.
I tried to set the timeout param in futures .as_completed, the TimeoutException could be successfully raised. However, it seems that the main process will still wait until the timeout child process is completed. Is there any approach to immediately kill the timeout child process and put next url into the pool?
from concurrent import futures
def process(url):
### Some time consuming operation
return result
def main():
urls = ['url1','url2','url3',...,'url100']
with futures.ProcessPoolExecutor(max_workers=10) as executor:
future_list = {executor.submit(process, url):url for url in urls}
results = []
try:
for future in futures.as_completed(future_list, timeout=10):
results.append(future.result())
except futures._base.TimeoutException:
print("timeout")
print(results)
if __name__ == '__main__':
main()
In above example, suppose that I have 100 urls, 10 of them are invalid and may cost a lot of time ,how to get the rest 90 urls' processed result list?
Not with the concurrent.futures library.
The pebble module has been developed to overcome such limitation.
from pebble import ProcessPool
from concurrent.futures import TimeoutError
with process.ProcessPool() as pool:
future = pool.schedule(function, args=(1,2), timeout=5)
try:
result = future.result() # blocks until results are ready
except TimeoutError as error:
print("Function took longer than %d seconds" % error.args[1])

Memory efficient massive http requests

I need to do an unlimited HTTP requests from a web API one after another and make it work efficiently and quite fast. (I need it for a utility so it should work no matter how many time im using it, also it should be able to be used on a web server(people use at the same time))
right now I'm using a threading with a queue but after a while of doing it I'm getting errors like:
'cant start a new thread'
'MemoryError'
or it may work a bit, but pretty slow.
this is a part of my code:
concurrent = 25
q = Queue(concurrent * 2)
for i in range(concurrent):
t = Thread(target=receiveJson)
t.daemon = True
t.start()
for url in get_urls():
q.put(url.strip())
q.join()
*get_urls() is a simple function that returns a list of urls(unknown length)
this is my recieveJson(thread target):
def receiveJson():
while True:
url = q.get()
res = request.get(url).json()
q.task_done()
The problem is coming from your Threads never ending, notice that there is no exit condition in your receiveJson function. The simplest way to signal it should end is usually by enqueuing None:
def receiveJson():
while True:
url = q.get()
if url is None: # Exit condition allows thread to complete
q.task_done()
break
res = request.get(url).json()
q.task_done()
and then you can change the other code as follows:
concurrent = 25
q = Queue(concurrent * 2)
for i in range(concurrent):
t = Thread(target=receiveJson)
t.daemon = True
t.start()
for url in get_urls():
q.put(url.strip())
for i in range(concurrent):
q.put(None) # Add a None for each thread to be able to get and complete
q.join()
There are other ways of doing this, but this is the how to do it with the least amount of change to your code. If this is happening often, it might be worth looking into the concurrent.futures.ThreadPoolExecutor class to avoid the cost of opening threads very often.

Convert thread function to asyncio

I need to get some prices for an external API, one by one for dozen of objects and it's like 2 or 3 seconds for each request so it can become pretty long.
I (kind of) knew how to do multithread in python, i implemented it and it works fine and it's pretty fast.
Then i've recently discovered asyncio and it seems it could be useful in my situation instead of opening several thread.
So i tried to "convert" my multithread code to a code using asyncio, as you can see below, after reading some examples.
But when testOne doesn't work and the error is Task exception was never retrieved.
I cleaned the code for better understanding (let me know if you need more informations).
from threading import Thread
import asyncio
### ASYNC MULTI THREAD ####
def prixMulti(client, symbol, prix):
prix[symbol] = # API price request using client
def testMulti(client, sql):
prix = {}
objects = # Database request using sql
listeThread = []
for object in objects:
listeThread.append(Thread(target=prixMulti, args=(client, object['name'], prix)))
for t in listeThread:
t.start()
for t in listeThread:
t.join()
print(prix)
#### ASYNC ONE THREAD ####
async def prixOne(client, symbol):
return #same API price request using client
async def prixOneWait(client, symbol, prix):
prix[symbol] = await prixOne(client, symbol)
def testOne(client, sql):
prix = {}
objects = # Database request using sql
tasks = []
loop = asyncio.get_event_loop()
for object in objects:
tasks.append(loop.create_task(prixOneWait(client, prix, object['nom'] )))
loop.run_until_complete(asyncio.wait(tasks))
loop.close()
print(prix)
# Some code to initialise client and sql
testMulti(client, sql)
testOne(client, sql)

Thread objects not freed from memory

I wrote a continuous script that collects some data from the internet every few seconds, keeps it in memory for a while, periodically stores it all to db and then deletes it. To keep everything running smoothly I use threads to collect the data from several sources at the same time. To minimize db operations and to avoid conflict with other db processes, I only write every now and then.
The memory from the deleted variables is never returned and eventually becomes so large the script crashes (shown by tracemalloc and pympler). I guess I'm handling the data coming out of the threads wrong but I don't know how I could do it differently. Minimal example below.
Addition: I don't think I can use a queue because in reality multiple functions are threaded from this point, modifying different local variables.
import threading
import time
import tracemalloc
import pympler.muppy, pympler.summary
import gc
tracemalloc.start()
def a():
# collect data
collection.update({int(time.time()): list(range(1,1000))})
return
collection = {}
threads = []
start = time.time()
cycle = 0
while time.time() < start + 60:
cycle += 1
t = threading.Thread(target = a)
threads.append(t)
t.start()
time.sleep(1)
for t in threads:
if t.is_alive() == False:
t.join()
# periodically delete data
delete = []
for key, val in collection.items():
if key < time.time() - 10:
delete.append(key)
for delet in delete:
print('DELETING:', delet)
del collection[delet]
gc.collect()
print('CYCLE:', cycle, 'THREADS:', threading.active_count(), 'COLLECTION:', len(collection))
print(tracemalloc.get_traced_memory())
all_objects = pympler.muppy.get_objects()
sum1 = pympler.summary.summarize(all_objects)
pympler.summary.print_(sum1)

Resources