How to run Python subprocess consistently and in parallel? - multithreading

I wrote a python script that converts mp3 to wav with mpg123. Then ffmpeg takes the output of mpg123 and upsample it. Finally, ffmpeg output file uploads to cloud. All these steps must run consistently.
subprocess.run(['mpg123', ...])
subprocess.run(['ffmpeg', ...])
upload()
Suppose, I have many mp3 files and I'd like to run 10 threads at the same time. I know that Python offers threading in subprocess.Popen, threading, concurrent.futures and multiprocessing modules. What is the right way to parallelize this process?

You could use the MPipe library:
from mpipe import OrderedStage, Pipeline
def increment(value):
return value + 1
def double(value):
return value * 2
stage1 = OrderedStage(increment, 3)
stage2 = OrderedStage(double, 3)
pipe = Pipeline(stage1.link(stage2))
for number in range(10):
pipe.put(number)
pipe.put(None)
for result in pipe.results():
print(result)

Related

v2ray (vmess|vless|trojan|ss) test in python

Friends, how can I test the configurations of (vmess|vless|trojan|ss) with Python?
I need a function to test the speed of given v2ray configs
There is a project vmessping by v2fly that support only vmess but there is a LiteSpeedTest for trojan/ss
sample:
from subprocess import Popen, PIPE
def speedtest(vmesslink):
process = Popen(["./vmessspeed", vmesslink], stdout=PIPE)
stdout = process.communicate()[0]
return stdout

How to implement Multiprocessing in Azure Databricks - Python

I need to get details of each file from a directory. It is taking longer time. I need to implement Multiprocessing so that it's execution can be completed early.
My code is like this:
from pathlib import Path
from os.path import getmtime, getsize
from multiprocessing import Pool, Process
def iterate_directories(root_dir):
for child in Path(root_dir).iterdir():
if child.is_file():
modified_time = datetime.fromtimestamp(getmtime(file)).date()
file_size = getsize(file)
# further steps...
else:
iterate_directories(child) ## I need this to run on separate Process (in Parallel)
I tried to do recursive call using below, but it is not working. It comes out of loop immediately.
else:
p = Process(target=iterate_directories, args=(child))
Pros.append(p) # declared Pros as empty list.
p.start()
for p in Pros:
if not p.is_alive():
p.join()
What am I missing here? How can I run for sub-directories in parallel.
You have to get the directories list first and then you have to use multiprocessing pool to call the function.
something like below.
from pathlib import Path
from os.path import getmtime, getsize
from multiprocessing import Pool, Process
Filedetails = ''
def iterate_directories(root_dir):
for child in Path(root_dir).iterdir():
if child.is_file():
modified_time = datetime.fromtimestamp(getmtime(file)).date()
file_size = getsize(file)
Filedetails = Filedetails + '\n' + '{add file name details}' + modified_time + file_size
else:
iterate_directories(child) ## I need this to run on separate Process (in Parallel)
return Filesdetails #file return from that particular directory
pool = multiprocessing.Pool(processes={define how many processes you like to run in parallel})
results = pool.map(iterate_directories, {explicit directory list })
print(results) #entire collection will be printed here. it basically a list you can iterate individual directory level
.
pls let me know, how it goes.
The problem is this line:
if not p.is_alive():
What this translates to is that if the process is already complete, only then wait for it to complete, which obviously does not make much sense (you need to remove the not from the statement). Also, it is completely unnecessary as well. Calling .join does the same thing internally that p.is_alive does (except one blocks). So you can safely just do this:
for p in Pros:
p.join()
The code will then wait for all child processes to finish.

Need to do CPU bound processing using 2+ processes in Python by reading from a gzipped file

I have a gzipped file spanning (compressed 10GB, uncompressed 100GB) and which has some reports separated by demarcations and I have to parse it.
The parsing and processing the data is taking long time and hence is a CPU bound problem (not an IO bound problem). So I am planning to split the work into multiple processes using multiprocessing module. The problem is I am unable to send/share data to child processes efficiently. I am using subprocess.Popen to stream in the uncompressed data in parent process.
process = subprocess.Popen('gunzip --keep --stdout big-file.gz',
shell=True,
stdout=subprocess.PIPE)
I am thinking of using a Lock() to read/parse one report in child-process-1 and then release the lock, and switch to child-process-2 to read/parse next report and then switch back to child-process-1 to read/parse next report). When I share the process.stdout as args with the child processes, I get a pickling error.
I have tried to create multiprocessing.Queue() and multiprocessing.Pipe() to send data to child processes, but this is way too slow (in fact it is way slower than doing it in single thread ie serially).
Any thoughts/examples about sending data to child processes efficiently will help.
Could you try something simple instead? Have each worker process run its own instance of gunzip, with no interprocess communication at all. Worker 1 can process the first report and just skip over the second. The opposite for worker 2. Each worker skips every other report. Then an obvious generalization to N workers.
Or not ...
I think you'll need to be more specific about what you tried, and perhaps give more info about your problem (like: how many records are there? how big are they?).
Here's a program ("genints.py") that prints a bunch of random ints, one per line, broken into groups via "xxxxx\n" separator lines:
from random import randrange, seed
seed(42)
for i in range(1000):
for j in range(randrange(1, 1000)):
print(randrange(100))
print("xxxxx")
Because it forces the seed, it generates the same stuff every time. Now a program to process those groups, both in parallel and serially, via the most obvious way I first thought of. crunch() takes time quadratic in the number of ints in a group, so it's quite CPU-bound. The output from one run, using (as shown) 3 worker processes for the parallel part:
parallel result: 10,901,000,334 0:00:35.559782
serial result: 10,901,000,334 0:01:38.719993
So the parallelized run took about one-third the time. In what relevant way(s) does that differ from your problem? Certainly, a full run of "genints.py" produces less than 2 million bytes of output, so that's a major difference - but it's impossible to guess from here whether it's a relevant difference. Perahps, e.g., your problem is only very mildly CPU-bound? It's obvious from output here that the overheads of passing chunks of stdout to worker processes are all but insignificant in this program.
In short, you probably need to give people - as I just did for you - a complete program they can run that reproduces your problem.
import multiprocessing as mp
NWORKERS = 3
DELIM = "xxxxx\n"
def runjob():
import subprocess
# 'py' is just a shell script on my box that
# invokes the desired version of Python -
# which happened to be 3.8.5 for this run.
p = subprocess.Popen("py genints.py",
shell=True,
text=True,
stdout=subprocess.PIPE)
return p.stdout
# Return list of lines up to (but not including) next DELIM,
# or EOF. If the file is already exhausted, return None.
def getrecord(f):
result = []
foundone = False
for line in f:
foundone = True
if line == DELIM:
break
result.append(line)
return result if foundone else None
def crunch(rec):
total = 0
for a in rec:
for b in rec:
total += abs(int(a) - int(b))
return total
if __name__ == "__main__":
import datetime
now = datetime.datetime.now
s = now()
total = 0
f = runjob()
with mp.Pool(NWORKERS) as pool:
for i in pool.imap_unordered(crunch,
iter((lambda: getrecord(f)), None)):
total += i
f.close()
print(f"parallel result: {total:,}", now() - s)
s = now()
# try the same thing serially
total = 0
f = runjob()
while True:
rec = getrecord(f)
if rec is None:
break
total += crunch(rec)
f.close()
print(f"serial result: {total:,}", now() - s)

python speedup a simple function

I try to find a simple way to "speed up" simple functions for a big script so I googled for it and found 3 ways to do that.
but it seems the time they need is always the same.
so what I am doing wrong testing them?
file1:
from concurrent.futures import ThreadPoolExecutor as PoolExecutor
from threading import Thread
import time
import os
import math
#https://dev.to/rhymes/how-to-make-python-code-concurrent-with-3-lines-of-code-2fpe
def benchmark():
start = time.time()
for i in range (0, 40000000):
x = math.sqrt(i)
print(x)
end = time.time()
print('time', end - start)
with PoolExecutor(max_workers=3) as executor:
for _ in executor.map((benchmark())):
pass
file2:
#the basic way
from threading import Thread
import time
import os
import math
def calc():
start = time.time()
for i in range (0, 40000000):
x = math.sqrt(i)
print(x)
end = time.time()
print('time', end - start)
calc()
file3:
import asyncio
import uvloop
import time
import math
#https://github.com/magicstack/uvloop
async def main():
start = time.time()
for i in range (0, 40000000):
x = math.sqrt(i)
print(x)
end = time.time()
print('time', end - start)
uvloop.install()
asyncio.run(main())
every file needs about 180-200 sec
so i 'can't see' a difference.
I googled for it and found 3 ways to [speed up a function], but it seems the time they need is always the same. so what I am doing wrong testing them?
You seemed to have found strategies to speed up some code by parallelizing it, but you failed to implement them correctly. First, the speedup is supposed to come from running multiple instances of the function in parallel, and the code snippets make no attempt to do that. Then, there are other problems.
In the first example, you pass the result benchmark() to executor.map, which means all of benchmark() is immediately executed to completion, thus effectively disabling parallelization. (Also, executor.map is supposed to receive an iterable, not None, and this code must have printed a traceback not shown in the question.) The correct way would be something like:
# run the benchmark 5 times in parallel - if that takes less
# than 5x of a single benchmark, you've got a speedup
with ThreadPoolExecutor(max_workers=5) as executor:
for _ in range(5):
executor.submit(benchmark)
For this to actually produce a speedup, you should try to use ProcessPoolExecutor, which runs its tasks in separate processes and is therefore unaffected by the GIL.
The second code snippet never actually creates or runs a thread, it just executes the function in the main thread, so it's unclear how that's supposed to speed things up.
The last snippet doesn't await anything, so the async def works just like an ordinary function. Note that asyncio is an async framework based on switching between tasks blocked on IO, and as such can never speed CPU-bound calculations.

how to use wxpython threading to prevent blocking main loop

I'm working on a school project to develop a customized media player on python platform. The problem is when i use time.sleep(duration), it block the main loop of my GUI preventing it from updating. I've consulted my supervisor and was told to use multi-threading but i have no idea how to use threading. Would anyone advice me on how to implement threading in my scenario below?
Code:
def load_playlist(self, event):
playlist = ["D:\Videos\test1.mp4", "D:\Videos\test2.avi"]
for path in playlist:
#calculate each media file duration
ffmpeg_command = ['C:\\MPlayer-rtm-svn-31170\\ffmpeg.exe', '-i' , path]
pipe = subprocess.Popen(ffmpeg_command, shell=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
results = pipe.communicate()
#Regular expression to get the duration
length_regexp = 'Duration: (\d{2}):(\d{2}):(\d{2})\.\d+,'
re_length = re.compile(length_regexp)
# find the matches using the regexp that to compare with the buffer/string
matches = re_length.search(str(results))
#print matches
hour = matches.group(1)
minute = matches.group(2)
second = matches.group(3)
#Converting to second
hour_to_second = int(hour) * 60 * 60
minute_to_second = int(minute) * 60
second_to_second = int(second)
num_second = hour_to_second + minute_to_second + second_to_second
print num_second
#Play the media file
trackPath = '"%s"' % path.replace("\\", "/")
self.mplayer.Loadfile(trackPath)
#Sleep for the duration of second(s) for the video before jumping to another video
time.sleep(num_second) #THIS IS THE PROBLEM#
You'll probably want to take a look at the wxPython wiki which has several examples of using threads, Queues and other fun things:
http://wiki.wxpython.org/LongRunningTasks
http://wiki.wxpython.org/Non-Blocking%20Gui
I also wrote a tutorial on the subject here: http://www.blog.pythonlibrary.org/2010/05/22/wxpython-and-threads/
The main thing to keep in mind is that when you use threads, you cannot call your wx methods directly (i.e. myWidget.SetValue, etc). Instead, you need to use one of the wxPython threadsafe methods: wx.CallAfter, wx.CallLater or wx.PostEvent
You would start a new thread like any other multithreading example:
from threading import Thread
# in caller code, start a new thread
Thread(target=load_playlist).start()
However, you have to make sure that calls to wx have to deal with inter-thread communication. You cannot just call wx-code from this new thread. It will segfault. So, use wx.CallAfter:
# in load_playlist, you have to synchronize your wx calls
wx.CallAfter(self.mplayer.Loadfile, trackPath)

Resources