Restrict number of subprocess.Popen - python-3.x

Following from run multiple instances of python script simultaneously I can now write a python program to run multiple instances.
import sys
import subprocess
for i in range(1000):
subprocess.Popen([sys.executable, 'task.py', '{}in.csv'.format(i), '{}out.csv'.format(i)])
This starts 1000 subprocess simultaneously. If the command that each subprocess is runs is computationally resource intensive, this can result in load on the machine (may even crash).
Is there a way where I can restrict the number of subprocess to be run at a time? For example something like this:
if (#subprocess_currently_running = 10) {
wait(); // Or sleep
}
That is just allow 10 subprocess to run at a time. In case one out of ten finishes start a new one.

Counting Semaphore is an old-good mechanism which can be used to control/manage the maximum number of concurrently running threads/processes.
But as each subprocess.Popen object (implying process) needs to be waited for termination, the official doc tells us about the important downside of subprocess.Popen.wait()(for this case of multiple concurrent sub-processes):
Note: The function is implemented using a busy loop (non-blocking call and short sleeps). Use the asyncio module for an asynchronous
wait: see asyncio.create_subprocess_exec.
Thus, it's preferable for us to switch to:
asyncio.create_subprocess_exec
asyncio.Semaphore
How it can be implemented:
import asyncio
import sys
MAX_PROCESSES = 10
async def process_csv(i, sem):
async with sem: # controls/allows running 10 concurrent subprocesses at a time
proc = await asyncio.create_subprocess_exec(sys.executable, 'task.py',
f'{i}in.csv', f'{i}out.csv')
await proc.wait()
async def main():
sem = asyncio.Semaphore(MAX_PROCESSES)
await asyncio.gather(*[process_csv(i, sem) for i in range(1000)])
asyncio.run(main())

Related

Why does my process list show multiple threads when running aiohttp?

I'm currently using aiohttp in one of my projects which uses asyncio. After searching for reasons why I'm getting a high amount of memory usage I detected that aiohttp seem to create threads in the background.
I have broken down my code to this minimal code which shows my problem.
import asyncio
import aiohttp
from aiohttp import ClientSession
async def test1(link, session):
async with session.get(
link,
) as r:
print(r.status)
await asyncio.sleep(10)
async def test():
async with ClientSession(
cookie_jar=aiohttp.DummyCookieJar(),
) as session:
await asyncio.gather(test1("https://google.com", session))
loop = asyncio.get_event_loop()
loop.run_until_complete(test())
loop.close()
When running this with ps -e -T |grep python3 I get the following output, which is odd because it looks like it created a thread:
160304 160304 pts/5 00:00:00 python3
160304 160306 pts/5 00:00:00 python3
If I change the asyncio.gather to use one more test1 function and run the ps command again I get three threads instead:
160414 160414 pts/5 00:00:00 python3
160414 160416 pts/5 00:00:00 python3
160414 160417 pts/5 00:00:00 python3
This looks very problematic because my assumption was that aiohttp uses an event loop in a single thread, this is why I have used ThreadPoolExecutor to launch a specified amount of threads at the start of the program. If aiohttp creates a new thread for every session.get request then the amount of threads is possibly X specified threads * the current running HTTP requests.
For more context I'm using:
Python 3.8.10
Ubuntu 20.04.3 LTS
The purpose of my main program is to save the HTML of X amount of domains as quickly as possible. The current architecture is using ThreadPoolExecutor to spin up Y amount of threads and using it throughout the application life, then every thread sends Z amount of HTTP requests simultaneously using session.get and asyncio.gather. Is this the wrong approach and should I use another Python library instead of aiohttp? Is threading in combination with event loops redundant?
I have searched around on the web and I have not found an answer to this question, so I'm humbly asking the community for any smart input.
asyncio always has at least one thread pool under the hood with min(32, (os.cpu_count() or 1) + 4) threads started.
The pool is used by asyncio for DNS lookup internally.
Moreover, even if you setup aiohttp to use aiodns for DNS resolving, the default asyncio pool still exists (while does nothing).
In turn, aiohttp uses the default thread pool for some operations, mostly for local file handling.
For example, await session.post(url, data=open('filename', 'rb')) reads the file chunks for sending in threads; it helps to avoid long blocking calls.

Python: running many subprocesses from different threads is slow

I have a program with 1 process that starts a lot of threads.
Each thread might use subprocess.Popen to run some command.
I see that the time to run the command increases with the number of threads.
Example:
>>> def foo():
... s = time.time()
... subprocess.Popen('ip link show'.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE, universal_newlines=True).communicate()
... print(time.time() - s)
...
>>> foo()
0.028950929641723633
>>> [threading.Thread(target=foo).start() for _ in range(10)]
0.058995723724365234
0.07323050498962402
0.09158825874328613
0.11541390419006348 # !!!
0.08147192001342773
0.05238771438598633
0.0950784683227539
0.10175108909606934 # !!!
0.09703755378723145
0.06497764587402344
Is there another way of executing a lot of commands from single process in parallel that doesn't decrease the performance?
Python's threads are, of course, concurrent, but they do not really run in parallel because of the GIL. Therefore, they are not suitable for CPU-bound applications. If you need to truly parallelize something and allow it to run on all CPU cores, you will need to use multiple processes. Here is a nice answer discussing this in more detail: What are the differences between the threading and multiprocessing modules?.
For the above example, multiprocessing.pool may be a good choice (note that there is also a ThreadPool available in this module).
from multiprocessing.pool import Pool
import subprocess
import time
def foo(*args):
s = time.time()
subprocess.Popen('ip link show'.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE, universal_newlines=True).communicate()
return time.time() - s
if __name__ == "__main__":
with Pool(10) as p:
result = p.map(foo, range(10))
print(result)
# [0.018695592880249023, 0.009021520614624023, 0.01150059700012207, 0.02113938331604004, 0.014114856719970703, 0.01342153549194336, 0.011168956756591797, 0.014746427536010742, 0.013572454452514648, 0.008752584457397461]
result = p.map_async(foo, range(10))
print(result.get())
# [0.00636744499206543, 0.011589527130126953, 0.010645389556884766, 0.0070612430572509766, 0.013571739196777344, 0.009610414505004883, 0.007040739059448242, 0.010993719100952148, 0.012415409088134766, 0.0070383548736572266]
However, if your function is similar to the example in that it mostly just launches other processes and doesn't do a lot of calculations - I doubt parallelizing it will make much of a difference because the subprocesses can already run in parallel. Perhaps the slowdown occurs because your whole system gets overwhelmed for a moment because of all those processes (could be CPU usage is high or too many disk reads/writes are attempted within a short time). I would suggest taking a close look at system resources (Task Manager etc.) while running the program.
maybe it has nothing to do with python: Opening a new shell = opening a new file since basically everything is a file on linux
take a look at your limit for open files with this command (default is 1024):
ulimit
and try to raise it with this command to see if your code gets faster :
ulimit -n 2048

Do python-rq workers support multiprocessing module?

I currently have multiple python-rq workers executing jobs from a queue in parallel. Each job also utilizes the python multiprocessing module.
Job execution code is simply this:
from redis import Redis
from rq import Queue
q = Queue('calculate', connection=Redis())
job = q.enqueue(calculateJob, someArgs)
And calculateJob is defined as such:
import multiprocessing as mp
from functools import partial
def calculateJob (someArgs):
pool = mp.Pool()
result = partial(someFunc, someArgs=someArgs)
def someFunc(someArgs):
//do something
return output
So presumably when a job is being processed, all cores are automatically being utilized by that job. How does another worker processing another job in parallel execute its job if the first job is utilizing all cores already?
it depends how your system handles processes. Just like how opening a video + 5 more processes doesn't completely freeze your 6 core computer. Each worker is a new process. (a fork of a process really). Instead of doing a multiprocessing inside of a job, you can put each job on a queue and let rq handle multiprocessing by spawning multiple workers.

python3 multiprocessing.Pool with maxtasksperchild=1 does not terminate

When using multiprocessing.Pool in python 3.6 or 3.7 with maxtasksperchild=1, I noticed that some processes spawned by the pool are hanging and do not quit, even though the callback to their tasks was already executed. As a result, Pool.join() will block forever, even though all tasks are finished. In the process tree, running but idle child processes can be seen. The problem does not occur if maxtasksperchild=None.
The problem seems to be related to what the callback precisely does. The docs point out that the callback "should return immediately", as it will block other threads managing the pool.
A minimal example to reproduce this behavior on my machine is as follows: (Give it a few tries or increase the number of tasks when it does not block forever.)
from multiprocessing import Pool
from os import getpid
from random import random
from time import sleep
def do_stuff():
pass
def cb(arg):
sleep(random()) # can be replaced with print('foo')
p = Pool(maxtasksperchild=1)
number_of_tasks = 100 # a value may depend on your machine -- for mine 20 is sufficient to trigger the behavior
for i in range(number_of_tasks):
p.apply_async(do_stuff, callback=cb)
p.close()
print("joining ... (this should take just seconds)")
print("use the following command to watch the process tree:")
print(" watch -n .2 pstree -at -p %i" % getpid())
p.join()
Contrary to what I expected, p.join() in the last line will block forever even though do_stuff and cb were both called 100 times.
I am aware that sleep(random()) is in violation of the docs, but is print() also taking 'too long'? The way the docs are written suggest that a non-blocking callback function is required for performance and efficiency and make not clear that a 'slow' callback function will break the pool entirely.
Is print() forbidden in any multiprocessing.Pool callback function? (How to replace that functionality? What is "returning immediately", what is not?)
If yes, should the python documentation be updated to make this clear?
If yes, is it good python practice to rely on "fast" execution of python threads? Does this violate the rule that one should not make assumptions on execution order of threads?
Should I report this to the python bug tracker?

Python 3: create new process when another one finishes

I have an array of data to handle and handler that executing long (1-2 minutes) and takes a lot of memory for its calculations.
raw = ['a', 'b', 'c']
def handler():
# do something long
Since handler requires a lot of memory, I want to execute it in separate subprocess and kill it after execution to release memory. Something like the following snippet:
from multiprocessing import Process
for r in raw:
process = Process(target=handler, args=(r))
process.start()
The problem is that such approach leads to immediate running len(raw) processes. And it's not good.
Also, it's not needed to interchange any kind of data between subprocesses. Just run them consequently.
Therefore it would be great to run a few processes at the same time and add a new one once existing finishes.
How could it be implemented (if it's even possible)?
to run your processes sequentially, just join each process within the loop:
from multiprocessing import Process
for r in raw:
process = Process(target=handler, args=(r))
process.start()
process.join()
that way you're sure that only one process is running at the same time (no concurrency)
That's the simplest way. To run more than one process but limit the number of processes running at the same time, you can use a multiprocessing.Pool object and apply_async
I've built a simple example which computes the square of the argument, and simulates an heavy processing:
from multiprocessing import Pool
import time
def target(r):
time.sleep(5)
return(r*r)
raw = [1,2,3,4,5]
if __name__ == '__main__':
with Pool(3) as p: # 3 processes at a time
reslist = [p.apply_async(target, (r,)) for r in raw]
for result in reslist:
print(result.get())
Running this I get:
<5 seconds wait, time to compute the results>
1
4
9
<5 seconds wait, 3 processes max can run at the same time>
16
25

Resources