I am a bit confused on how best to do the following. I am not after the code, but rather what I should do and when.
I want to do the following
get a list of shops (model query)
create a pool for each shop (pool1, pool2, pool3)
for each shop get all the products
process each product by adding it to the pool, so pool1-product1,pool1-product2
The above shows that I have lots of shop and each shop has lots of products that need to be processed. I want the shops to be processed at the same time
I am confused on what is the best way to approach this
Any help would be appreciated
Thanks
Grant
You are trying to parallelize I/O-bound tasks (tasks consisting mostly of network calls and other I/Os). You can either use a cooperative multitasking library such as AsyncIO of use a thread pool for that. Here is what it could look like using a thread pool:
from concurrent.futures import ThreadPoolExecutor
def process_product(product):
...
def process_shop(shop):
for product in shop.products:
process_product(product)
def main():
shops = [...]
with ThreadPoolExecutor() as executor:
executor.map(process_shop, shops)
Usually, you do not want to create a new thread for each shop as it could be expensive and inefficient, hence the use of a pool.
With AsyncIO, you can use the gather() function to launch concurrent tasks. It would ressembles:
import asyncio
async def process_product(product):
...
async def process_shop(shop):
for product in shop.products:
await process_product(product)
async def main():
shops = [...]
await asyncio.gather(*[process_shop(shop) for shop in shops])
if __name__ == "__main__":
asyncio.run(main())
Related
Concretely, I'm using Flask to process a request, pseudocode like this:
from flask import Flask, request
app = Flask(__name__)
#app.route("/foo", methods=["POST"])
def foo():
data = request.get_json() # {"request_id": "abc", "data": "some text"}
result_a = do_task_a(data) # returns {"result_a": "a"}, maybe about 1 second to finish
result_b = do_task_b(data) # returns {"result_b": "b"}, maybe about 1 second to finish
result_c = do_task_c(data) # returns {"result_c": "c"}, maybe about 1 second to finish
result = {
"result_a": result_a["result_a"],
"result_b": result_b["result_b"],
"result_c": result_c["result_c"]}
return result
app.run(host='0.0.0.0', port=4000, threaded=False)
Here, do_task_a, do_task_b, do_task_c are completely independent subtasks, I know I can use multiprocessing.Process to create processes to finish these three subtasks, and use join() to wait for subtask done, But I don't know it's proper way to create Process for every request?
Maybe I can use multiprocessing.Queue to help, But I don't find a good way.
I search for multiprocessing, but can't figure out a good solution.
I'm not a python guy, but indeed creating processes is sn expensive operation
If its possible - create threads they're cheaper than processes.
If you run the request multiple times - you can do even better than that, because creating threads per request is still quite expensive
Even more advanced setup is to create a "pre-loaded" thread pool. Like N threads that you always keep in memory ready for running arriving task.
In terms of technical solution I've found This article that explains how to create thread pools in python 3.2+
I have 3 classes that represent nearly isolated processes that can be run concurrently (meant to be persistent, like 3 main() loops).
class DataProcess:
...
def runOnce(self):
...
class ComputeProcess:
...
def runOnce(self):
...
class OtherProcess:
...
def runOnce(self):
...
Here's the pattern I'm trying to achieve:
start various streams
start each process
allow each process to publish to any stream
allow each process to listen to any stream (at various points in it's loop) and behave accordingly (allow for interruption of it's current task or not, etc.)
For example one 'process' Listens for external data. Another process does computation on some of that data. The computation process might be busy for a while, so by the time it comes back to start and checks the stream, there may be many values that piled up. I don't want to just use a queue because, actually I don't want to be forced to process each one in order, I'd rather be able to implement logic like, "if there is one or multiple things waiting, just run your process one more time, otherwise go do this interruptible task while you wait for something to show up."
That's like a lot, right? So I was thinking of using an actor model until I discovered RxPy. I saw that a stream is like a subject
from reactivex.subject import BehaviorSubject
newData = BehaviorSubject()
newModel = BehaviorSubject()
then I thought I'd start 3 threads for each of my high level processes:
thread = threading.Thread(target=data)
threads = {'data': thread}
thread = threading.Thread(target=compute)
threads = {'compute': thread}
thread = threading.Thread(target=other)
threads = {'other': thread}
for thread in threads.values():
thread.start()
and I thought the functions of those threads should listen to the streams:
def data():
while True:
DataProcess().runOnce() # publishes to stream inside process
def compute():
def run():
ComuteProcess().runOnce()
newData.events.subscribe(run())
newModel.events.subscribe(run())
def other():
''' not done '''
ComuteProcess().runOnce()
Ok, so that's what I have so far. Is this pattern going to give me what I'm looking for?
Should I use threading in conjunction with rxpy or just use rxpy scheduler stuff to achieve concurrency? If so how?
I hope this question isn't too vague, I suppose I'm looking for the simplest framework where I can have a small number of computational-memory units (like objects because they have internal state) that communicate with each other and work in parallel (or concurrently). At the highest level I want to be able to treat these computational-memory units (which I've called processes above) as like individuals who mostly work on their own stuff but occasionally broadcast or send a message to a specific other individual, requesting information or providing information.
Am I perhaps actually looking for an actor model framework? or is this RxPy setup versatile enough to achieve that without extreme complexity?
Thanks so much!
I have a simple rest service which allows you to create task. When a client requests a task - it returns a unique task number and starts executing in a separate thread. The easiest way to implement it
class Executor:
def __init__(self, max_workers=1):
self.executor = ThreadPoolExecutor(max_workers)
def execute(self, body, task_number):
# some logic
pass
def some_rest_method(request):
body = json.loads(request.body)
task_id = generate_task_id()
Executor(max_workers=1).execute(body)
return Response({'taskId': task_id})
Is it a good idea to create each time ThreadPoolExecutor with one (!) workers if i know than one request - is one new task (new thread). Perhaps it is worth putting them in the queue somehow? Maybe the best option is to create a regular stream every time?
Is it a good idea to create each time ThreadPoolExecutor...
No. That completely defeats the purpose of a thread pool. The reason for using a thread pool is so that you don't create and destroy a new thread for every request. Creating and destroying threads is expensive. The idea of a thread pool is that it keeps the "worker thread(s)" alive and re-uses it/them for each next request.
...with just one thread
There's a good use-case for a single-threaded executor, though it probably does not apply to your problem. The use-case is, you need a sequence of tasks to be performed "in the background," but you also need them to be performed sequentially. A single-thread executor will perform the tasks, one after another, in the same order that they were submitted.
Perhaps it is worth putting them in the queue somehow?
You already are putting them in a queue. Every thread pool has a queue of pending tasks. When you submit a task (i.e., executor.execute(...)) that puts the task into the queue.
what's the best way...in my case?
The bones of a simplistic server look something like this (pseudo-code):
POOL = ThreadPoolExecutor(...with however many threads seem appropriate...)
def service():
socket = create_a_socket_that_listens_on_whatever_port()
while True:
client_connection = socket.accept()
POOL.submit(request_handler, connection=connection)
def request_handler(connection):
request = receive_request_from(connection)
reply = generate_reply_based_on(request)
send_reply_to(reply, connection)
connection.close()
def main():
initialize_stuff()
service()
Of course, there are many details that I have left out. I can't design it for you. Especially not in Python. I've written servers like this in other languages, but I'm pretty new to Python.
I have an application that lets me select whether to use threads or processes:
def _get_future(self, workers):
if self.config == "threadpool":
self.logger.debug("using thread pools")
executor = ThreadPoolExecutor(max_workers=workers)
else:
self.logger.debug("using process pools")
executor = ProcessPoolExecutor(max_workers=workers)
return executor
Later I execute the code:
self.executor = self._get_future()
for component in components:
self.logger.debug("submitting {} to future ".format(component))
self.future_components.append(self.executor.submit
(self._send_component, component))
# Wait for all tasks to finish
while self.future_components:
self.future_components.pop().result()
When I use processes, my Applications gets stuck. The _send_component method is never called. When I use threads all works fine.
The problem is the imperative approach, this is a use case for a functional approach.
self._send_component is a member function of a class. Separate processes mean no joint memory to share variables.
The solution was to rewrite the code so that _send_component is a static method.
I am using "watchdog" api to keep checking changes in a folder in my filesystem. Whatever files changes in that folder, I pass them to a particular function which starts threads for each file I pass them.
But watchdog, or any other filesystem watcher api (in my knowledge), notifies users file by file i.e. as the files come by, they notify the user. But I would like it to notify me a whole bunch of files at a time so that I can pass that list to my function and take use of multi-threading. Currently, when I use "watchdog", it notifies me one file at a time and I am only able to pass that one file to my function. I want to pass it many files at a time to be able to have multithreading.
One solution that comes to my mind is: you see when you copy a bunch of files in a folder, OS shows you a progress bar. If it would be possible for me to be notified when that progress bar is done, then it would be a perfect solution for my question. But I don't know if that is possible.
Also I know that watchdog is a polling API, and an ideal API for watching filesystem would be interrupt driven api like pyinotify. But I didn't find any API which was interrupt driven and also cross platform. iWatch is good, but only for linux, and I want something for all OS. So, if you have suggestions on any other API, please do let me know.
Thanks.
Instead of accumulating filesystem events, you could spawn a pool of worker
threads which get tasks from a common queue. The watchdog thread could then put
tasks in the queue as filesystem events occur. Done this way, a worker thread
can start working as soon as an event occurs.
For example,
import logging
import Queue
import threading
import time
import watchdog.observers as observers
import watchdog.events as events
logger = logging.getLogger(__name__)
SENTINEL = None
class MyEventHandler(events.FileSystemEventHandler):
def on_any_event(self, event):
super(MyEventHandler, self).on_any_event(event)
queue.put(event)
def __init__(self, queue):
self.queue = queue
def process(queue):
while True:
event = queue.get()
logger.info(event)
if __name__ == '__main__':
logging.basicConfig(level=logging.DEBUG,
format='[%(asctime)s %(threadName)s] %(message)s',
datefmt='%H:%M:%S')
queue = Queue.Queue()
num_workers = 4
pool = [threading.Thread(target=process, args=(queue,)) for i in range(num_workers)]
for t in pool:
t.daemon = True
t.start()
event_handler = MyEventHandler(queue)
observer = observers.Observer()
observer.schedule(
event_handler,
path='/tmp/testdir',
recursive=True)
observer.start()
try:
while True:
time.sleep(1)
except KeyboardInterrupt:
observer.stop()
observer.join()
Running
% mkdir /tmp/testdir
% script.py
yields output like
[14:48:31 Thread-1] <FileCreatedEvent: src_path=/tmp/testdir/.#foo>
[14:48:32 Thread-2] <FileModifiedEvent: src_path=/tmp/testdir/foo>
[14:48:32 Thread-3] <FileModifiedEvent: src_path=/tmp/testdir/foo>
[14:48:32 Thread-4] <FileDeletedEvent: src_path=/tmp/testdir/.#foo>
[14:48:42 Thread-1] <FileDeletedEvent: src_path=/tmp/testdir/foo>
[14:48:47 Thread-2] <FileCreatedEvent: src_path=/tmp/testdir/.#bar>
[14:48:49 Thread-4] <FileCreatedEvent: src_path=/tmp/testdir/bar>
[14:48:49 Thread-4] <FileModifiedEvent: src_path=/tmp/testdir/bar>
[14:48:49 Thread-1] <FileDeletedEvent: src_path=/tmp/testdir/.#bar>
[14:48:54 Thread-2] <FileDeletedEvent: src_path=/tmp/testdir/bar>
Doug Hellman has written an excellent set of tutorials (which has now been edited into a book) which should help you get started:
on using Queue
the threading module
how to setup and use a pool of worker processes
how to setup a pool of worker threads
I didn't actually end up using a multiprocessing Pool or ThreadPool as discussed
in the last two links, but you may find them useful anyway.