What is mean when it raises a PicklingError? - python-3.x

Hy there!
I'm new on python 3.
I'm using the pvmomi module to get a dict of vm's from my server. When i try to run my file, with multiprocessing, i get the following Error:
_pickle.PicklingError: Can't pickle : attribute lookup vim.VirtualMachine on pyVmomi.VmomiSupport failed
What does this mean?
Here is a part of my code:
def login(vm):
#do something
if __name__=='__main__':
cpu = mp.cpu_count()
workers = mp.Pool(cpu)
workers.map(login,range(1))
for vm in vmDict:
login(vm)
My biggest problem comes from the for loop. I need this loop to do the jobs for every dictitem but only one pool worker do the job. Now i have configured my code to this below and it raises the PicklingError.
Thanks for help. It drives me crazy!

The stdlib pickle (.py) module imports the builtin C-coded _pickle module. The pickle module can serialize most Python objects and is used to transport Python objects between processes. In particular, pickle is used by multiprocessing (and perhaps by pyvmomi). User-defined classes sometimes define special methods (reduce and reducex, I believe) to help the pickle and unpickle processes.
The exception message says that an attribute lookup failed. Perhaps the pyVmomi object is not properly configured to be pickled. You might check to module doc to see if it says anything about pickle support.

Related

Python multiprocessing TypeError: cannot pickle '_queue.SimpleQueue' object

all. I currently meet a question. What I want to do is run a socket server by multiprocessing package. But once I run the program, there is an error will be reported, saying, TypeError: cannot pickle '_queue.SimpleQueue' object. I have no idea why this will happen. So, could anyone help me resolve it?
class xxx:
......
def test_connect(self, max_num=10, alive=True, mode='TCP', IP='127.0.0.1', PORT=8080):
p = multiprocessing.Process(
target=self.create_tcp_server, )
p.start()
p.join()
......
If not, I have another question with multi-thread question,
As the picture shows, I want to run comac_connect via thread pool and in the comac_connect it will call another function message_handle through the same thread pool. But the real situation is the function message_handle cannot be executed, and the program holds there. Is there any solution that I can resolve it?

Pytorch data pipeline

I am trying to implement a bounded buffer like solution where data generator and the model work as two separate processes. The data generator preprocess the data and stores in a shared queue (with predefined max size to limit the memory usage). The model on the other hand consumes data from this queue at its own pace until the queue is empty. Below is the snippet of my implementation.
'''
self._buffer is an object of multiprocessing.Queue
'''
def produce(self):
for obj in self._generator:
self._buffer.put(obj=obj, block=True, timeout=None)
self._buffer.put(obj=None)
def consume(self):
while True:
dat = self._buffer.get(block=True, timeout=None)
if dat is None:
break
else:
# Train model on `dat`
def run(self):
pt = multiprocessing.Process(target=self.produce)
ct = multiprocessing.Process(target=self.consume)
pt.start()
ct.start()
pt.join()
ct.join()
However, the solution above does not work. I used the torch.multiprocessing as instructed the documentation. I also set torch.multiprocessing.set_start_method('spawn') in order to avoid "RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method"
But now I get "TypeError: cannot pickle 'generator' object". How this can be fixed?
Since you work with pytorch you should use the Dataset and Dataloader approach. This handles all problems with multiprocessing, shared memory and so on for you.
You can have map style datasets or things like iterable-style.Best to read the official documentation, what is what and how they work.
In your case you probably are fine with an iterable-style dataset. I used both approaches for similar cases. You can have the iterable style dataset, which you might need if you don't know how much samples you will be processing. For other cases I had a map-style dataset, where I knew the total number of my samples beforehand (e.g. processing all images in a directory) and could use a sequential sampler to give me the elements in order.
Regarding one of your problems. All errors like this TypeError: cannot pickle 'generator' object happen when you have objects which can't be serialized. For serialization pickle is used. In your case self._generator seems to be an object which can't be serialized for some reason. Without code it is not possible to say why. I had cases where used wrapped c++ packages created with pybind where objects were not serializable or I had some mutex variables somewhere.

How to deal with python3 multiprocessing in __main__.py

Il posed question, I did not understand the ture cause of the issue (it seems to have been related to my usage of flask in one of the subprocesses).
PLEASE IGNORE THIS (can't delete due to bounty)
Essentially, I have to start some Processes and or a pool when running a python library as a module.
However, since __name__ == '__main__' is always true in __main__.py this proves to be an issue (see multiprocessing docs: https://docs.python.org/3/library/multiprocessing.html)
I've attempted multiple solutions ranging from: pytgquabr.com:8182/58288945/using-multiprocessing-with-runpy to a file-based mutext to only allow the contents of main to run once but multiprocessing still behaves strangely (e.g. Processes die almost as soon as they start with no error logs).
Any idea of what the "proper" way of going about this is ?
Guarding the __main__ module is only needed if an object defined inside __main__ is used in another process. Looking up this definition is what causes the execution of __main__ in the subprocess.
When using a __main__.py, restrict all definitions used with multiprocessing to other modules. __main__.py should only import and use these.
# my_package/some_module.py
def module_print(*args, **kwargs):
"""Function defined in some module - fine for use inside multiprocess"""
print(*args, **kwargs)
# my_package/__main__.py
import multiprocessing # imports are allowed
from .some_module import module_print
def do_multiprocess():
"""Function defined in __main__ module - fine for use wrapping multiprocess"""
with multiprocessing.Pool(processes=12) as pool:
pool.map(module_print, range(20)) # multiprocessing external function is allowed
do_multiprocess() # directly calling __main__ function is allowed

How is it possible to execute python code during deserialization?

I was reading about pickling in the context of persisting instances, and ran across this snippet:
Pickle files can be hacked. If you receive a raw pickle file over the network, don't trust it! It could have malicious code in it, that would run arbitrary python when you try to de-pickle it. [1]
My understanding is that pickling turns a data-structure into an array of bytes, and the pickle library also contains methods to take a pickled byte array and rebuild a python instance from it.
I tested some code to see if simply putting code into the class or init method would run it:
import pickle
class A:
print('class')
def __init__(self):
print('instance')
a = A()
print('pickling...')
with open('/home/usrname/Desktop/pfile', 'wb') as pfile:
pickle.dump(a, pfile, pickle.HIGHEST_PROTOCOL)
print('de-pickling...')
with open('/home/usrname/Desktop/pfile', 'rb') as pfile:
a2 = pickle.load(pfile)
However this only yields
class
instance
pickling...
de-pickling...
suggesting that the __ init__ method doesn't actually get run when the instance is unpickled. So I'm still confused how you would make code run during that process.
Really thorough writeup here: https://intoli.com/blog/dangerous-pickles/
From what I understand, it has to do with how pickles are interpreted by the Pickle Machine (PM) and run. You can craft a pickle file that will cause it to evaluate using eval() the statements provided.

Multiprocessing hangs when applying function to pandas dataframe Python 3.7.1

I am trying to parallelize a function on my pandas dataframe and I'm running into an issue where it seems that the multiprocessing library is hanging. I am doing this all within a Jupyter notebook with myFunction() existing in a separate .py file. Can someone point out what I am doing wrong here?
Surprisingly, this piece of code has worked previously on my Windows 7 machine with the same version of python. I have just copied the file over to my Mac laptop.
I also use tqdm so I can monitor the progress, the behavior is the same with or without it.
#This function hands the multiprocessing
from multiprocessing import Pool, cpu_count
import numpy as np
import tqdm
def parallelize_dataframe(df, func):
num_partitions = cpu_count()*2 # number of partitions to split dataframe
num_cores = cpu_count() # number of cores on your machine
df_split = np.array_split(df, num_partitions)
pool = Pool(num_cores)
return pd.concat(list(tqdm.tqdm_notebook(pool.imap(func, df_split),total=num_partitions)))
#My function that I am applying to the dataframe is in another file
#myFunction retrieves a JSON from an API for each ID in myDF and converts it to a dataframe
from myFuctions import myFunction
#Code that calls the parallelize function
finalDF = parallelize_dataframe(myDF,myFunction)
The expected result is a concatenation of a list of dataframes that have been retrieved by myFunction(). This is worked in the past, but now the process seems to hang indefinitely without any error messages.
Q : Can someone point out what I am doing wrong here?
You just expected the MacOS to use the same mechanism for process-instantiations as the WinOS did in past.
The multiprocessing module does not do the same set of things on either of the supported O/S-es and even reported some methods to be dangerous and also had changed the default behaviour on MacOS- and Linux-based systems.
Next steps to try to move forward :
re-read how to do the explicit setup of the call-signatures in multiprocessing documentation ( avoid hidden dependency of the code-behaviour on "new" default values )
test if may avoid the cases where multiprocessing will spawn the full-copy of the python-interpreter process, that many times as you instruct ( memory allocations could soon get devastatingly large, if many replicas try to get instantiated beyond the localhost RAM-footprint, just due to a growing number of CPU-cores )
test if the "worker"-code is not computing intensive but rather network-remote API-call latency driven. In such a case asyncio/await decorated tools will help more with latency-masking than going into in the case of IO-latency dominated use-cases inefficient multiprocessing spawned and rather expensive full-copy concurrency of many python-processes (that just stay waitin for receiving remote-API answers ).
last but not least - performance-sensitive code best runs outside any mediating-ecosystem, like the interactivity-focused Jupyter-notebooks are.

Resources