I currently have multiple python-rq workers executing jobs from a queue in parallel. Each job also utilizes the python multiprocessing module.
Job execution code is simply this:
from redis import Redis
from rq import Queue
q = Queue('calculate', connection=Redis())
job = q.enqueue(calculateJob, someArgs)
And calculateJob is defined as such:
import multiprocessing as mp
from functools import partial
def calculateJob (someArgs):
pool = mp.Pool()
result = partial(someFunc, someArgs=someArgs)
def someFunc(someArgs):
//do something
return output
So presumably when a job is being processed, all cores are automatically being utilized by that job. How does another worker processing another job in parallel execute its job if the first job is utilizing all cores already?
it depends how your system handles processes. Just like how opening a video + 5 more processes doesn't completely freeze your 6 core computer. Each worker is a new process. (a fork of a process really). Instead of doing a multiprocessing inside of a job, you can put each job on a queue and let rq handle multiprocessing by spawning multiple workers.
Related
Following from run multiple instances of python script simultaneously I can now write a python program to run multiple instances.
import sys
import subprocess
for i in range(1000):
subprocess.Popen([sys.executable, 'task.py', '{}in.csv'.format(i), '{}out.csv'.format(i)])
This starts 1000 subprocess simultaneously. If the command that each subprocess is runs is computationally resource intensive, this can result in load on the machine (may even crash).
Is there a way where I can restrict the number of subprocess to be run at a time? For example something like this:
if (#subprocess_currently_running = 10) {
wait(); // Or sleep
}
That is just allow 10 subprocess to run at a time. In case one out of ten finishes start a new one.
Counting Semaphore is an old-good mechanism which can be used to control/manage the maximum number of concurrently running threads/processes.
But as each subprocess.Popen object (implying process) needs to be waited for termination, the official doc tells us about the important downside of subprocess.Popen.wait()(for this case of multiple concurrent sub-processes):
Note: The function is implemented using a busy loop (non-blocking call and short sleeps). Use the asyncio module for an asynchronous
wait: see asyncio.create_subprocess_exec.
Thus, it's preferable for us to switch to:
asyncio.create_subprocess_exec
asyncio.Semaphore
How it can be implemented:
import asyncio
import sys
MAX_PROCESSES = 10
async def process_csv(i, sem):
async with sem: # controls/allows running 10 concurrent subprocesses at a time
proc = await asyncio.create_subprocess_exec(sys.executable, 'task.py',
f'{i}in.csv', f'{i}out.csv')
await proc.wait()
async def main():
sem = asyncio.Semaphore(MAX_PROCESSES)
await asyncio.gather(*[process_csv(i, sem) for i in range(1000)])
asyncio.run(main())
In the example code here all asyncio tasks are started first. After that the tasks are resumed if the IO operation is finished.
The output looks like this where you can see the 6 result messages after the first 6 start messages.
-- Starting https://jamanetwork.com/rss/site_3/67.xml...
-- Starting https://www.b-i-t-online.de/bitrss.xml...
-- Starting http://twitrss.me/twitter_user_to_rss/?user=cochranecollab...
-- Starting http://twitrss.me/twitter_user_to_rss/?user=cochranecollab...
-- Starting https://jamanetwork.com/rss/site_3/67.xml...
-- Starting https://www.b-i-t-online.de/bitrss.xml...
28337 size for http://twitrss.me/twitter_user_to_rss/?user=cochranecollab
28337 size for http://twitrss.me/twitter_user_to_rss/?user=cochranecollab
1938204 size for https://www.b-i-t-online.de/bitrss.xml
1938204 size for https://www.b-i-t-online.de/bitrss.xml
38697 size for https://jamanetwork.com/rss/site_3/67.xml
38697 size for https://jamanetwork.com/rss/site_3/67.xml
FINISHED with 6 results from 6 tasks.
But what I would expect and what whould speed up the thing in my cases is something like this
-- Starting https://jamanetwork.com/rss/site_3/67.xml...
-- Starting https://www.b-i-t-online.de/bitrss.xml...
-- Starting http://twitrss.me/twitter_user_to_rss/?user=cochranecollab...
1938204 size for https://www.b-i-t-online.de/bitrss.xml
-- Starting http://twitrss.me/twitter_user_to_rss/?user=cochranecollab...
28337 size for http://twitrss.me/twitter_user_to_rss/?user=cochranecollab
28337 size for http://twitrss.me/twitter_user_to_rss/?user=cochranecollab
-- Starting https://jamanetwork.com/rss/site_3/67.xml...
38697 size for https://jamanetwork.com/rss/site_3/67.xml
-- Starting https://www.b-i-t-online.de/bitrss.xml...
28337 size for http://twitrss.me/twitter_user_to_rss/?user=cochranecollab
28337 size for http://twitrss.me/twitter_user_to_rss/?user=cochranecollab
1938204 size for https://www.b-i-t-online.de/bitrss.xml
38697 size for https://jamanetwork.com/rss/site_3/67.xml
FINISHED with 6 results from 6 tasks.
In my real world code I have hundreds of download tasks like this. It is usual that some of the downloads are finished before all of them are started.
Is there a way to handle this with asyncio?
Here is a minimal working example:
#!/usr/bin/env python3
import random
import urllib.request
import asyncio
from concurrent.futures import ThreadPoolExecutor
executor = ThreadPoolExecutor()
loop = asyncio.get_event_loop()
urls = ['https://www.b-i-t-online.de/bitrss.xml',
'https://jamanetwork.com/rss/site_3/67.xml',
'http://twitrss.me/twitter_user_to_rss/?user=cochranecollab']
async def parse_one_url(u):
print('-- Starting {}...'.format(u))
r = await loop.run_in_executor(executor,
urllib.request.urlopen, u)
r = '{} size for {}'.format(len(r.read()), u)
print(r)
async def do_async_parsing():
tasks = [
parse_one_url(u)
for u in urls
]
completed, pending = await asyncio.wait(tasks)
results = [task.result() for task in completed]
print('FINISHED with {} results from {} tasks.'
.format(len(results), len(tasks)))
if __name__ == '__main__':
# blow up the urls
urls = urls * 2
random.shuffle(urls)
try:
#loop.set_debug(True)
loop.run_until_complete(do_async_parsing())
finally:
loop.close()
Side-Question: Isn't asyncio useless in my case? Isn't it easier to use mutliple threads only?
In my real world code I have hundreds of download tasks like this. It is usual that some of the downloads are finished before all of them are started.
Well, you did create all the downloads upfront and instructed asyncio to launch them all using asyncio.wait. Just starting to execute a coroutine is almost free, so there is no reason for this part to be limited in any way. However, the tasks actually submitted to ThreadPoolExecutor are capped to the number of workers in the pool, the default being 5 times the number of CPUs, but configurable. If the number of URLs exceeds the number of workers, you should get the desired behavior. (But to actually observe it, you need to move the logging prints into the function managed by the executor.)
Note that the synchronous call to r.read() must also reside inside the function run by the executor, otherwise it will block the entire event loop. The corrected portion of the code would look like this:
def urlopen(u):
print('-- Starting {}...'.format(u))
r = urllib.request.urlopen(u) # blocking call
content = r.read() # another blocking call
print('{} size for {}'.format(len(content), u))
async def parse_one_url(u):
await loop.run_in_executor(executor, urlopen, u)
The above is, however, not idiomatic use of asyncio. Normally the idea is that you don't use threads at all, but call natively async code, for example using aiohttp. Then you get the benefits of asyncio, such as working cancellation and scalability to a large number of tasks. In that setup you would limit the number of concurrent tasks by trivially wrapping the retrieval in an asyncio.Semaphore.
If your whole actual logic consists of synchronous calls, you don't need asyncio at all; you can directly submit futures to the executor and use concurrent.futures synchronization functions like wait() and as_completed to wait for them to finish.
I have an application that lets me select whether to use threads or processes:
def _get_future(self, workers):
if self.config == "threadpool":
self.logger.debug("using thread pools")
executor = ThreadPoolExecutor(max_workers=workers)
else:
self.logger.debug("using process pools")
executor = ProcessPoolExecutor(max_workers=workers)
return executor
Later I execute the code:
self.executor = self._get_future()
for component in components:
self.logger.debug("submitting {} to future ".format(component))
self.future_components.append(self.executor.submit
(self._send_component, component))
# Wait for all tasks to finish
while self.future_components:
self.future_components.pop().result()
When I use processes, my Applications gets stuck. The _send_component method is never called. When I use threads all works fine.
The problem is the imperative approach, this is a use case for a functional approach.
self._send_component is a member function of a class. Separate processes mean no joint memory to share variables.
The solution was to rewrite the code so that _send_component is a static method.
I have a set of kafka-python consumers that consume from different kafka topics continuously and parallely.
My question is how to kick off the consumers in parallel using single python script?
And what is the best way to manage(start/stop/monitor) these consumers.
if I write ex:
run.py
import consumer1, consumer2, consumer3
consumer1.start()
consumer2.start()
consumer3.start()
It just hangs on consumer1.start() as the script does not return any value and keeps running.
You can have different threads for each consumer to consume messages in parallel. For example you can have:
consumer1_thread = threading.Thread(target=consumer1.start, args=())
consumer2_thread = threading.Thread(target=consumer2.start, args=())
consumer3_thread = threading.Thread(target=consumer3.start, args=())
consumer1_thread.start()
consumer2_thread.start()
consumer3_thread.start()
You, can see the logs of each thread in parallel and write some logic to stop individual thread if required.
I have an array of data to handle and handler that executing long (1-2 minutes) and takes a lot of memory for its calculations.
raw = ['a', 'b', 'c']
def handler():
# do something long
Since handler requires a lot of memory, I want to execute it in separate subprocess and kill it after execution to release memory. Something like the following snippet:
from multiprocessing import Process
for r in raw:
process = Process(target=handler, args=(r))
process.start()
The problem is that such approach leads to immediate running len(raw) processes. And it's not good.
Also, it's not needed to interchange any kind of data between subprocesses. Just run them consequently.
Therefore it would be great to run a few processes at the same time and add a new one once existing finishes.
How could it be implemented (if it's even possible)?
to run your processes sequentially, just join each process within the loop:
from multiprocessing import Process
for r in raw:
process = Process(target=handler, args=(r))
process.start()
process.join()
that way you're sure that only one process is running at the same time (no concurrency)
That's the simplest way. To run more than one process but limit the number of processes running at the same time, you can use a multiprocessing.Pool object and apply_async
I've built a simple example which computes the square of the argument, and simulates an heavy processing:
from multiprocessing import Pool
import time
def target(r):
time.sleep(5)
return(r*r)
raw = [1,2,3,4,5]
if __name__ == '__main__':
with Pool(3) as p: # 3 processes at a time
reslist = [p.apply_async(target, (r,)) for r in raw]
for result in reslist:
print(result.get())
Running this I get:
<5 seconds wait, time to compute the results>
1
4
9
<5 seconds wait, 3 processes max can run at the same time>
16
25