Pickle object : multithreading safe? - multithreading

I am using Python3.5.1, with threading module.
I saw lot's of Questions about writing in dictionary and Pickle file from several threads safely. In my case I want to read it and the question is:
Can I load (safely) a pickle file several times at the same time ?
Pseudo-Code:
import sys
import threading
import pickle
def function_1( pickle_file, arg_blue ):
my_dic = pickle.load( open( pickle_file, "rb" ) )
# process my_dic with arg_blue
def function_2( pickle_file, arg_red ):
my_dic = pickle.load( open( pickle_file, "rb" ) )
# process my_dic with arg_red
def main( pickle_file, arg_blue, arg_red ):
# Using two threads to call function_1 and function_2 at the same time.
# Function 1 and function 2 will not exchange data. Is it better to use multiprocess module ?
# thread_blue will run function_1
# thread_red will run function_2
# Each of them will write in a distinct output
if __name__ == "__main__":
main( sys.argv[1], sys.argv[2], sys.argv[3] )
Call of the script:
python3.5 my_script.py my_pickle_file.p blue red
Any suggestion or commentary will be highly appreciated !

Yes, reading a file from multiple threads or processes is safe, as long as you open the file in the thread -- i.e. don't pass the same open handle to multiple threads, that's bad.
Note that multithreading in Python may not actually help if you want to parallelise work due to the global interpreter lock..

Related

call several time the same subprocess python function

I need to process-parallelize some computations that are done several time.
So the subprocess python function has to keep alive between two calls.
In a perfect world I would need something like that:
class Computer:
def __init__(self, x):
self.x = x
# Creation of quite heavy python objects that cannot be pickled !!
def call(self, y):
return x+y
process = Computer(4) ## NEED MAGIC HERE to keep "call" alive in a subprocess !!
print(process.call(1)) # prints 5 (=4+1)
print(process.call(12)) # prints 16 (=4+12)
I can follow this answer and communicate via asyncio.subprocess.PIPE, but in my actual use case,
the call argument is a list of list of integers
the call answer is a list of strings
Thus it could be cool to avoid to serialize/deserialize the arguments and return values by hand.
Any ideas of how to keep the function call "alive" and ready to receive new calls ?
Here is an answer, based on this one, but
several subprocesses are created
each subprocess has its own identifier
their calls are parallelized
a small layer to allow exchange of jsons instead of plain byte strings.
hello.py
#!/usr/bin/python3
# This is the taks to be done.
# A task consist in receiving a json assumed to be
# {"vector": [...]}
# and return a json with the length of the vector and
# the worker id.
import sys
import time
import json
ident = sys.argv[1]
while True:
str_data = input()
data = json.loads(str_data)
command = data.get("command", None)
if command == "quit":
answer = {"comment": "I'm leaving",
"my id": ident}
print(json.dumps(answer), end="\n")
sys.exit(1)
time.sleep(1) # simulates 1s of heavy work
answer = {"size": len(data['vector']),
"my id": ident}
print(json.dumps(answer), end="\n")
main.py
#!/usr/bin/python3
import json
from subprocess import Popen, PIPE
import concurrent.futures
from concurrent.futures import ThreadPoolExecutor
dprint = print
def create_proc(arg):
cmd = ["./hello.py", arg]
process = Popen(cmd, stdin=PIPE, stdout=PIPE)
return process
def make_call(proc, arg):
"""Make the call in a thread."""
str_arg = json.dumps(arg)
txt = bytes(str_arg + '\n', encoding='utf8')
proc.stdin.write(txt)
proc.stdin.flush()
b_ans = proc.stdout.readline()
s_ans = b_ans.decode('utf8')
j_ans = json.loads(s_ans)
return j_ans
def search(executor, procs, data):
jobs = [executor.submit(make_call, proc, data) for proc in procs]
answer = []
for job in concurrent.futures.as_completed(jobs):
got_ans = job.result()
answer.append(got_ans)
return answer
def main():
n_workers = 50
idents = [f"{i}st" for i in range(0, n_workers)]
executor = ThreadPoolExecutor(n_workers)
# Create `n_workers` subprocesses waiting for data to work with.
# The subprocesses are all different because they receive different
# "initialization" id.
procs = [create_proc(ident) for ident in idents]
data = {"vector": [1, 2, 23]}
answers = search(executor, procs, data) # takes 1s instead of 5 !
for answer in answers:
print(answers)
search(executor, procs, {"command": "quit"})
main()

How to confirm multiprocessing library is being used?

I am trying to use multiprocessing for the below code. The code seems to run a bit faster than the for loop inside the function.
How can I confirm I using the library and not the just the for loop?
from multiprocessing import Pool
from multiprocessing import cpu_count
import requests
import pandas as pd
data= pd.read_csv('~/Downloads/50kNAE000.txt.1' ,sep="\t", header=None)
data = data[0].str.strip("0 ")
lst = []
def request(x):
for i,v in x.items():
print(i)
file = requests.get(v)
lst.append(file.text)
#time.sleep(1)
if __name__ == "__main__":
pool = Pool(cpu_count())
results = pool.map(request(data))
pool.close() # 'TERM'
pool.join() # 'KILL'
Multiprocessing has overhead. It has to start the process and transfer function data via interprocess mechanism. Just running a single function in another process vs. running that same function normally is always going to be slower. The advantage is actually doing parallelism with significant work in the functions that makes the overhead minimal.
You can call multiprocessing.current_process().name to see the process name change.

How to transfer data between two separate scripts in Multiprocessing?

I am using multiprocessing to run two python scripts in parallel. p1.y continually updates a certain variable and the latest value of the variable will be displayed by p2.py after every 2seconds. The code for multiprocessing of the two scripts are given below:
import os
from multiprocessing import Process
def script1():
os.system("p1.py")
def script2():
os.system("p2.py")
if __name__ == '__main__':
p = Process(target=script1)
q = Process(target=script2)
p.start()
q.start()
p.join()
q.join()
I am unable to transfer the value of the variable being updated by p1.py to p2.py. How should I approach the problem in a very simple way?

Why python asyncio process in a thread seems unstable on Linux?

I try to run a python3 asynchronous external command from a Qt Application. Before I was using a multiprocessing thread to do it without freezing the Qt Application. But now, I would like to do it with a QThread to be able to pickle and give a QtWindows as argument for some other functions (not presented here). I did it and test it with success on my Windows OS, but I tried the application on my Linux OS, I get the following error :RuntimeError: Cannot add child handler, the child watcher does not have a loop attached
From that point I tried to isolate the problem, and I obtain the minimal (as possible as I could) example below that replicates the problem.
Of course, as I mentioned before, if I replace QThreadPool by a list of multiprocessing.thread this example is working well. I also realized something that astonished me: if I uncomment the line rc = subp([sys.executable,"./HelloWorld.py"]) in the last part of the example, it works also. I couldn't explain myself why.
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
## IMPORTS ##
from functools import partial
from PyQt5 import QtCore
from PyQt5.QtCore import QThreadPool, QRunnable, QCoreApplication
import sys
import asyncio.subprocess
# Global variables
Qpool = QtCore.QThreadPool()
def subp(cmd_list):
""" """
if sys.platform.startswith('linux'):
new_loop = asyncio.new_event_loop()
asyncio.set_event_loop(new_loop)
elif sys.platform.startswith('win'):
new_loop = asyncio.ProactorEventLoop() # for subprocess' pipes on Windows
asyncio.set_event_loop(new_loop)
else :
print('[ERROR] OS not available for encodage... EXIT')
sys.exit(2)
rc, stdout, stderr= new_loop.run_until_complete(get_subp(cmd_list) )
new_loop.close()
if rc!=0 :
print('Exit not zero ({}): {}'.format(rc, sys.exc_info()[0]) )#, exc_info=True)
return rc, stdout, stderr
async def get_subp(cmd_list):
""" """
print('subp: '+' '.join(cmd_list) )
# Create the subprocess, redirect the standard output into a pipe
create = asyncio.create_subprocess_exec(*cmd_list, stdout=asyncio.subprocess.PIPE, stderr=asyncio.subprocess.PIPE) #
proc = await create
# read child's stdout/stderr concurrently (capture and display)
try:
stdout, stderr = await asyncio.gather(
read_stream_and_display(proc.stdout),
read_stream_and_display(proc.stderr))
except Exception:
proc.kill()
raise
finally:
rc = await proc.wait()
print(" [Exit {}] ".format(rc)+' '.join(cmd_list))
return rc, stdout, stderr
async def read_stream_and_display(stream):
""" """
async for line in stream:
print(line, flush=True)
class Qrun_from_job(QtCore.QRunnable):
def __init__(self, job, arg):
super(Qrun_from_job, self).__init__()
self.job=job
self.arg=arg
def run(self):
code = partial(self.job)
code()
def ThdSomething(job,arg):
testRunnable = Qrun_from_job(job,arg)
Qpool.start(testRunnable)
def testThatThing():
rc = subp([sys.executable,"./HelloWorld.py"])
if __name__=='__main__':
app = QCoreApplication([])
# rc = subp([sys.executable,"./HelloWorld.py"])
ThdSomething(testThatThing,'tests')
sys.exit(app.exec_())
with the HelloWorld.py file:
#!/usr/bin/env python3
import sys
if __name__=='__main__':
print('HelloWorld')
sys.exit(0)
Therefore I have two questions: How to make this example working properly with QThread ? And why a previous call of an asynchronous task (with a call of subp function) change the stability of the example on Linux ?
EDIT
Following advices of #user4815162342, I tried with a run_coroutine_threadsafe with the code below. But it is not working and returns the same error ie RuntimeError: Cannot add child handler, the child watcher does not have a loop attached. I also tried to change the threading command by its equivalent in the module mutliprocessing ; and with the last one, the command subp is never launched.
The code :
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
## IMPORTS ##
import sys
import asyncio.subprocess
import threading
import multiprocessing
# at top-level
loop = asyncio.new_event_loop()
def spin_loop():
asyncio.set_event_loop(loop)
loop.run_forever()
def subp(cmd_list):
# submit the task to asyncio
fut = asyncio.run_coroutine_threadsafe(get_subp(cmd_list), loop)
# wait for the task to finish
rc, stdout, stderr = fut.result()
return rc, stdout, stderr
async def get_subp(cmd_list):
""" """
print('subp: '+' '.join(cmd_list) )
# Create the subprocess, redirect the standard output into a pipe
proc = await asyncio.create_subprocess_exec(*cmd_list, stdout=asyncio.subprocess.PIPE, stderr=asyncio.subprocess.PIPE) #
# read child's stdout/stderr concurrently (capture and display)
try:
stdout, stderr = await asyncio.gather(
read_stream_and_display(proc.stdout),
read_stream_and_display(proc.stderr))
except Exception:
proc.kill()
raise
finally:
rc = await proc.wait()
print(" [Exit {}] ".format(rc)+' '.join(cmd_list))
return rc, stdout, stderr
async def read_stream_and_display(stream):
""" """
async for line in stream:
print(line, flush=True)
if __name__=='__main__':
threading.Thread(target=spin_loop, daemon=True).start()
# multiprocessing.Process(target=spin_loop, daemon=True).start()
print('thread passed')
rc = subp([sys.executable,"./HelloWorld.py"])
print('end')
sys.exit(0)
As a general design principle, it's unnecessary and wasteful to create new event loops only to run a single subroutine. Instead, create an event loop, run it in a separate thread, and use it for all your asyncio needs by submitting tasks to it using asyncio.run_coroutine_threadsafe.
For example:
# at top-level
loop = asyncio.new_event_loop()
def spin_loop():
asyncio.set_event_loop(loop)
loop.run_forever()
asyncio.get_child_watcher().attach_loop(loop)
threading.Thread(target=spin_loop, daemon=True).start()
# ... the rest of your code ...
With this in place, you can easily execute any asyncio code from any thread whatsoever using the following:
def subp(cmd_list):
# submit the task to asyncio
fut = asyncio.run_coroutine_threadsafe(get_subp(cmd_list), loop)
# wait for the task to finish
rc, stdout, stderr = fut.result()
return rc, stdout, stderr
Note that you can use add_done_callback to be notified when the future returned by asyncio.run_coroutine_threadsafe finishes, so you might not need a thread in the first place.
Note that all interaction with the event loop should go either through the afore-mentioned run_coroutine_threadsafe (when submitting coroutines) or through loop.call_soon_threadsafe when you need the event loop to call an ordinary function. For example, to stop the event loop, you would invoke loop.call_soon_threadsafe(loop.stop).
I suspect that what you are doing is simply unsupported - according to the documentation:
To handle signals and to execute subprocesses, the event loop must be run in the main thread.
As you are trying to execute a subprocess, I do not think running a new event loop in another thread works.
Thing is, Qt already has an event loop, and what you really need is to convince asyncio to use it. That means that you need an event loop implementation that provides the "event loop interface for asyncio" implemented on top of "Qt's event loop".
I believe that asyncqt provides such an implementation. You may want to try to use QEventLoop(app) in place of asyncio.new_event_loop().

Python multiprocessing script partial output

I am following the principles laid down in this post to safely output the results which will eventually be written to a file. Unfortunately, the code only print 1 and 2, and not 3 to 6.
import os
import argparse
import pandas as pd
import multiprocessing
from multiprocessing import Process, Queue
from time import sleep
def feed(queue, parlist):
for par in parlist:
queue.put(par)
print("Queue size", queue.qsize())
def calc(queueIn, queueOut):
while True:
try:
par=queueIn.get(block=False)
res=doCalculation(par)
queueOut.put((res))
queueIn.task_done()
except:
break
def doCalculation(par):
return par
def write(queue):
while True:
try:
par=queue.get(block=False)
print("response:",par)
except:
break
if __name__ == "__main__":
nthreads = 2
workerQueue = Queue()
writerQueue = Queue()
considerperiod=[1,2,3,4,5,6]
feedProc = Process(target=feed, args=(workerQueue, considerperiod))
calcProc = [Process(target=calc, args=(workerQueue, writerQueue)) for i in range(nthreads)]
writProc = Process(target=write, args=(writerQueue,))
feedProc.start()
feedProc.join()
for p in calcProc:
p.start()
for p in calcProc:
p.join()
writProc.start()
writProc.join()
On running the code it prints,
$ python3 tst.py
Queue size 6
response: 1
response: 2
Also, is it possible to ensure that the write function always outputs 1,2,3,4,5,6 i.e. in the same order in which the data is fed into the feed queue?
The error is somehow with the task_done() call. If you remove that one, then it works, don't ask me why (IMO that's a bug). But the way it works then is that the queueIn.get(block=False) call throws an exception because the queue is empty. This might be just enough for your use case, a better way though would be to use sentinels (as suggested in the multiprocessing docs, see last example). Here's a little rewrite so your program uses sentinels:
import os
import argparse
import multiprocessing
from multiprocessing import Process, Queue
from time import sleep
def feed(queue, parlist, nthreads):
for par in parlist:
queue.put(par)
for i in range(nthreads):
queue.put(None)
print("Queue size", queue.qsize())
def calc(queueIn, queueOut):
while True:
par=queueIn.get()
if par is None:
break
res=doCalculation(par)
queueOut.put((res))
def doCalculation(par):
return par
def write(queue):
while not queue.empty():
par=queue.get()
print("response:",par)
if __name__ == "__main__":
nthreads = 2
workerQueue = Queue()
writerQueue = Queue()
considerperiod=[1,2,3,4,5,6]
feedProc = Process(target=feed, args=(workerQueue, considerperiod, nthreads))
calcProc = [Process(target=calc, args=(workerQueue, writerQueue)) for i in range(nthreads)]
writProc = Process(target=write, args=(writerQueue,))
feedProc.start()
feedProc.join()
for p in calcProc:
p.start()
for p in calcProc:
p.join()
writProc.start()
writProc.join()
A few things to note:
the sentinel is putting a None into the queue. Note that you need one sentinel for every worker process.
for the write function you don't need to do the sentinel handling as there's only one process and you don't need to handle concurrency (if you would do the empty() and then get() thingie in your calc function you would run into a problem if e.g. there's only one item left in the queue and both workers check empty() at the same time and then both want to do get() and then one of them is locked forever)
you don't need to put feed and write into processes, just put them into your main function as you don't want to run it in parallel anyway.
how can I have the same order in output as in input? [...] I guess multiprocessing.map can do this
Yes map keeps the order. Rewriting your program into something simpler (as you don't need the workerQueue and writerQueue and adding random sleeps to prove that the output is still in order:
from multiprocessing import Pool
import time
import random
def calc(val):
time.sleep(random.random())
return val
if __name__ == "__main__":
considerperiod=[1,2,3,4,5,6]
with Pool(processes=2) as pool:
print(pool.map(calc, considerperiod))

Resources