How is it possible to execute python code during deserialization? - python-3.x

I was reading about pickling in the context of persisting instances, and ran across this snippet:
Pickle files can be hacked. If you receive a raw pickle file over the network, don't trust it! It could have malicious code in it, that would run arbitrary python when you try to de-pickle it. [1]
My understanding is that pickling turns a data-structure into an array of bytes, and the pickle library also contains methods to take a pickled byte array and rebuild a python instance from it.
I tested some code to see if simply putting code into the class or init method would run it:
import pickle
class A:
print('class')
def __init__(self):
print('instance')
a = A()
print('pickling...')
with open('/home/usrname/Desktop/pfile', 'wb') as pfile:
pickle.dump(a, pfile, pickle.HIGHEST_PROTOCOL)
print('de-pickling...')
with open('/home/usrname/Desktop/pfile', 'rb') as pfile:
a2 = pickle.load(pfile)
However this only yields
class
instance
pickling...
de-pickling...
suggesting that the __ init__ method doesn't actually get run when the instance is unpickled. So I'm still confused how you would make code run during that process.

Really thorough writeup here: https://intoli.com/blog/dangerous-pickles/
From what I understand, it has to do with how pickles are interpreted by the Pickle Machine (PM) and run. You can craft a pickle file that will cause it to evaluate using eval() the statements provided.

Related

Pytorch data pipeline

I am trying to implement a bounded buffer like solution where data generator and the model work as two separate processes. The data generator preprocess the data and stores in a shared queue (with predefined max size to limit the memory usage). The model on the other hand consumes data from this queue at its own pace until the queue is empty. Below is the snippet of my implementation.
'''
self._buffer is an object of multiprocessing.Queue
'''
def produce(self):
for obj in self._generator:
self._buffer.put(obj=obj, block=True, timeout=None)
self._buffer.put(obj=None)
def consume(self):
while True:
dat = self._buffer.get(block=True, timeout=None)
if dat is None:
break
else:
# Train model on `dat`
def run(self):
pt = multiprocessing.Process(target=self.produce)
ct = multiprocessing.Process(target=self.consume)
pt.start()
ct.start()
pt.join()
ct.join()
However, the solution above does not work. I used the torch.multiprocessing as instructed the documentation. I also set torch.multiprocessing.set_start_method('spawn') in order to avoid "RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method"
But now I get "TypeError: cannot pickle 'generator' object". How this can be fixed?
Since you work with pytorch you should use the Dataset and Dataloader approach. This handles all problems with multiprocessing, shared memory and so on for you.
You can have map style datasets or things like iterable-style.Best to read the official documentation, what is what and how they work.
In your case you probably are fine with an iterable-style dataset. I used both approaches for similar cases. You can have the iterable style dataset, which you might need if you don't know how much samples you will be processing. For other cases I had a map-style dataset, where I knew the total number of my samples beforehand (e.g. processing all images in a directory) and could use a sequential sampler to give me the elements in order.
Regarding one of your problems. All errors like this TypeError: cannot pickle 'generator' object happen when you have objects which can't be serialized. For serialization pickle is used. In your case self._generator seems to be an object which can't be serialized for some reason. Without code it is not possible to say why. I had cases where used wrapped c++ packages created with pybind where objects were not serializable or I had some mutex variables somewhere.

How to create an async multiprocessing JobQueue in Python?

I'm trying to make a Python 'JobQueue' that performs computationally intensive tasks asynchronously, on a local machine, with a mechanism that returns the results of each task to the main process. Python's multiprocessing.Pool has an apply_async() function that meets those requirements by accepting an arbitrary function, its multiple arguments, and callback functions that return the results. For example...
import multiprocessing
pool = multiprocessing.Pool(poolsize)
pool.apply_async(func, args=args,
callback=mycallback,
error_callback=myerror_callback)
The only problem is that the function given to apply_async() must be serializable with Pickle and the functions I need to run concurrently are not. FYI, the reason is, the target function is a member of an object that contains an IDL object, for example:
from idlpy import IDL
self.idl_obj = IDL.obj_new('ImageProcessingEngine')
This is the error message received at the pool.apply_async() line:
'Can't pickle local object 'IDL.__init__.<locals>.run''
What I tried
I made a simple implementation of a JobQueue that works perfectly fine in Python 3.6+ provided the Job object and it's run() method are Pickleable. I like how the main process can receive an arbitrarily complex amount of data returned from the asynchronously executed function via a callback function.
I tried to use pathos.pools.ProcessPool since it uses dill instead of pickle. However, it doesn't have a method similar to apply_async().
Are there any other options, or 3rd party libraries that provide this functionality using dill, or by some other means?
How about creating a stub function that would instantiate the IDL endopoint as a function static variable?
Please note that this is only a sketch of the code as it is hard to say from the question if you are passing IDL objects as parameters to the function you run in parallel or it serves another purpose.
def stub_fun(paramset):
if 'idl_obj' not in dir(stub_fun): # instantiate once
stub_fun.idl_obj = IDL.obj_new('ImageProcessingEngine')
return stub_fun.idl_obj(paramset)

Using 'with' with 'next()' in python 3 [duplicate]

I came across the Python with statement for the first time today. I've been using Python lightly for several months and didn't even know of its existence! Given its somewhat obscure status, I thought it would be worth asking:
What is the Python with statement
designed to be used for?
What do
you use it for?
Are there any
gotchas I need to be aware of, or
common anti-patterns associated with
its use? Any cases where it is better use try..finally than with?
Why isn't it used more widely?
Which standard library classes are compatible with it?
I believe this has already been answered by other users before me, so I only add it for the sake of completeness: the with statement simplifies exception handling by encapsulating common preparation and cleanup tasks in so-called context managers. More details can be found in PEP 343. For instance, the open statement is a context manager in itself, which lets you open a file, keep it open as long as the execution is in the context of the with statement where you used it, and close it as soon as you leave the context, no matter whether you have left it because of an exception or during regular control flow. The with statement can thus be used in ways similar to the RAII pattern in C++: some resource is acquired by the with statement and released when you leave the with context.
Some examples are: opening files using with open(filename) as fp:, acquiring locks using with lock: (where lock is an instance of threading.Lock). You can also construct your own context managers using the contextmanager decorator from contextlib. For instance, I often use this when I have to change the current directory temporarily and then return to where I was:
from contextlib import contextmanager
import os
#contextmanager
def working_directory(path):
current_dir = os.getcwd()
os.chdir(path)
try:
yield
finally:
os.chdir(current_dir)
with working_directory("data/stuff"):
# do something within data/stuff
# here I am back again in the original working directory
Here's another example that temporarily redirects sys.stdin, sys.stdout and sys.stderr to some other file handle and restores them later:
from contextlib import contextmanager
import sys
#contextmanager
def redirected(**kwds):
stream_names = ["stdin", "stdout", "stderr"]
old_streams = {}
try:
for sname in stream_names:
stream = kwds.get(sname, None)
if stream is not None and stream != getattr(sys, sname):
old_streams[sname] = getattr(sys, sname)
setattr(sys, sname, stream)
yield
finally:
for sname, stream in old_streams.iteritems():
setattr(sys, sname, stream)
with redirected(stdout=open("/tmp/log.txt", "w")):
# these print statements will go to /tmp/log.txt
print "Test entry 1"
print "Test entry 2"
# back to the normal stdout
print "Back to normal stdout again"
And finally, another example that creates a temporary folder and cleans it up when leaving the context:
from tempfile import mkdtemp
from shutil import rmtree
#contextmanager
def temporary_dir(*args, **kwds):
name = mkdtemp(*args, **kwds)
try:
yield name
finally:
shutil.rmtree(name)
with temporary_dir() as dirname:
# do whatever you want
I would suggest two interesting lectures:
PEP 343 The "with" Statement
Effbot Understanding Python's
"with" statement
1.
The with statement is used to wrap the execution of a block with methods defined by a context manager. This allows common try...except...finally usage patterns to be encapsulated for convenient reuse.
2.
You could do something like:
with open("foo.txt") as foo_file:
data = foo_file.read()
OR
from contextlib import nested
with nested(A(), B(), C()) as (X, Y, Z):
do_something()
OR (Python 3.1)
with open('data') as input_file, open('result', 'w') as output_file:
for line in input_file:
output_file.write(parse(line))
OR
lock = threading.Lock()
with lock:
# Critical section of code
3.
I don't see any Antipattern here.
Quoting Dive into Python:
try..finally is good. with is better.
4.
I guess it's related to programmers's habit to use try..catch..finally statement from other languages.
The Python with statement is built-in language support of the Resource Acquisition Is Initialization idiom commonly used in C++. It is intended to allow safe acquisition and release of operating system resources.
The with statement creates resources within a scope/block. You write your code using the resources within the block. When the block exits the resources are cleanly released regardless of the outcome of the code in the block (that is whether the block exits normally or because of an exception).
Many resources in the Python library that obey the protocol required by the with statement and so can used with it out-of-the-box. However anyone can make resources that can be used in a with statement by implementing the well documented protocol: PEP 0343
Use it whenever you acquire resources in your application that must be explicitly relinquished such as files, network connections, locks and the like.
Again for completeness I'll add my most useful use-case for with statements.
I do a lot of scientific computing and for some activities I need the Decimal library for arbitrary precision calculations. Some part of my code I need high precision and for most other parts I need less precision.
I set my default precision to a low number and then use with to get a more precise answer for some sections:
from decimal import localcontext
with localcontext() as ctx:
ctx.prec = 42 # Perform a high precision calculation
s = calculate_something()
s = +s # Round the final result back to the default precision
I use this a lot with the Hypergeometric Test which requires the division of large numbers resulting form factorials. When you do genomic scale calculations you have to be careful of round-off and overflow errors.
An example of an antipattern might be to use the with inside a loop when it would be more efficient to have the with outside the loop
for example
for row in lines:
with open("outfile","a") as f:
f.write(row)
vs
with open("outfile","a") as f:
for row in lines:
f.write(row)
The first way is opening and closing the file for each row which may cause performance problems compared to the second way with opens and closes the file just once.
See PEP 343 - The 'with' statement, there is an example section at the end.
... new statement "with" to the Python
language to make
it possible to factor out standard uses of try/finally statements.
points 1, 2, and 3 being reasonably well covered:
4: it is relatively new, only available in python2.6+ (or python2.5 using from __future__ import with_statement)
The with statement works with so-called context managers:
http://docs.python.org/release/2.5.2/lib/typecontextmanager.html
The idea is to simplify exception handling by doing the necessary cleanup after leaving the 'with' block. Some of the python built-ins already work as context managers.
Another example for out-of-the-box support, and one that might be a bit baffling at first when you are used to the way built-in open() behaves, are connection objects of popular database modules such as:
sqlite3
psycopg2
cx_oracle
The connection objects are context managers and as such can be used out-of-the-box in a with-statement, however when using the above note that:
When the with-block is finished, either with an exception or without, the connection is not closed. In case the with-block finishes with an exception, the transaction is rolled back, otherwise the transaction is commited.
This means that the programmer has to take care to close the connection themselves, but allows to acquire a connection, and use it in multiple with-statements, as shown in the psycopg2 docs:
conn = psycopg2.connect(DSN)
with conn:
with conn.cursor() as curs:
curs.execute(SQL1)
with conn:
with conn.cursor() as curs:
curs.execute(SQL2)
conn.close()
In the example above, you'll note that the cursor objects of psycopg2 also are context managers. From the relevant documentation on the behavior:
When a cursor exits the with-block it is closed, releasing any resource eventually associated with it. The state of the transaction is not affected.
In python generally “with” statement is used to open a file, process the data present in the file, and also to close the file without calling a close() method. “with” statement makes the exception handling simpler by providing cleanup activities.
General form of with:
with open(“file name”, “mode”) as file_var:
processing statements
note: no need to close the file by calling close() upon file_var.close()
The answers here are great, but just to add a simple one that helped me:
with open("foo.txt") as file:
data = file.read()
open returns a file
Since 2.6 python added the methods __enter__ and __exit__ to file.
with is like a for loop that calls __enter__, runs the loop once and then calls __exit__
with works with any instance that has __enter__ and __exit__
a file is locked and not re-usable by other processes until it's closed, __exit__ closes it.
source: http://web.archive.org/web/20180310054708/http://effbot.org/zone/python-with-statement.htm

What is mean when it raises a PicklingError?

Hy there!
I'm new on python 3.
I'm using the pvmomi module to get a dict of vm's from my server. When i try to run my file, with multiprocessing, i get the following Error:
_pickle.PicklingError: Can't pickle : attribute lookup vim.VirtualMachine on pyVmomi.VmomiSupport failed
What does this mean?
Here is a part of my code:
def login(vm):
#do something
if __name__=='__main__':
cpu = mp.cpu_count()
workers = mp.Pool(cpu)
workers.map(login,range(1))
for vm in vmDict:
login(vm)
My biggest problem comes from the for loop. I need this loop to do the jobs for every dictitem but only one pool worker do the job. Now i have configured my code to this below and it raises the PicklingError.
Thanks for help. It drives me crazy!
The stdlib pickle (.py) module imports the builtin C-coded _pickle module. The pickle module can serialize most Python objects and is used to transport Python objects between processes. In particular, pickle is used by multiprocessing (and perhaps by pyvmomi). User-defined classes sometimes define special methods (reduce and reducex, I believe) to help the pickle and unpickle processes.
The exception message says that an attribute lookup failed. Perhaps the pyVmomi object is not properly configured to be pickled. You might check to module doc to see if it says anything about pickle support.

Cannot access hdf5 array from a thread

I am using h5py processing the array in several threads:
def process(start, end, dataset):
for i in xrange(start, end):
# Do something with dataset[i]
f = h5py.File(path, 'r')
dataset = f[...]
worker = [threading.Thread(target=process, args=(start, end, dataset)) \
for start, end in ...]
I get an error when accessing the array from the thread
File "/usr/lib/python2.7/dist-packages/h5py/_hl/dataset.py", line 367, in __getitem__
if self._local.astype is not None:
AttributeError: 'thread._local' object has no attribute 'astype'
I have really no clue why this happens :/, I can access dtype and shape, but on access of any slice I get this error.
Edit:
Calling
process(0, len(dataset), dataset)
in the main thread works as expected.
Can you provide a full example that reproduces the error? Please also mention the versions of Python, h5py, and the hdf5 library that you're using, as well as your OS.
The following code works with my setup, does it work on yours?
import threading
import h5py
import numpy as np
db = h5py.File('/tmp/test.h5', 'w')
dataset = db.create_dataset("mydata", data=np.random.random((10,)))
def process(start, end, dataset):
for i in xrange(start, end):
print(dataset[i])
workers = [threading.Thread(target=process, args=(start, end, dataset))
for start, end in [[1,2], [3,4]]]
workers[0].start()
workers[1].start()
db.close()
May I ask why you are using threads to process your hdf5 file? Note that hdf5 does not provide thread-level concurrency. Although the above example "works" (premise: hdf5 compiled with "threadsafe" option), the two operations will run sequentially. Hdf5 operations are blocking and do not release the global interpreter lock during I/O, which prevents threads from running in parallel.
If you want your code to be executed in parallel, you have to use processes instead of threads. Note, however, that parallel reading and writing is safe only with the MPI version of h5py/hdf5.
I had the same problem.
It was solved by updating h5py (from 2.2 to to 2.6).

Resources