I have some float numbers to be stored in a big (500K x 500K) matrix. I am storing them in chunks, by using arrays of variable sizes (in accordance to some specific conditions).
I have a parallellised code (Python3.3 and h5py) which produces the arrays and put them in a shared queue, and one dedicated process that pops from the queue and writes them one-by-one in the HDF5 matrix. It works as expected approximately 90% of the time.
Occasionally, I got writing errors for specific arrays. If I run it multiple times, the faulty arrays change all the times.
Here's the code:
def writer(in_q):
# Open HDF5 archive
hdf5_file = h5py.File("./google_matrix_test.hdf5")
hdf5_scores = hdf5_file['scores']
while True:
# Get some data
try:
data = in_q.get(timeout=5)
except:
hdf5_file.flush()
print('HDF5 archive updated.')
break
# Process the data
try:
hdf5_scores[data[0], data[1]:data[2]+1] = numpy.matrix(data[3:])
except:
# Print faulty chunk's info
print('E: ' + str(data[0:3]))
in_q.put(data) # <- doesn't solve
in_q.task_done()
def compute():
jobs_queue = JoinableQueue()
scores_queue = JoinableQueue()
processes = []
processes.append(Process(target=producer, args=(jobs_queue, data,)))
processes.append(Process(target=writer, args=(scores_queue,)))
for i in range(10):
processes.append(Process(target=consumer, args=(jobs_queue,scores_queue,)))
for p in processes:
p.start()
processes[1].join()
scores_queue.join()
Here's the error:
Process Process-2:
Traceback (most recent call last):
File "/local/software/python3.3/lib/python3.3/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/local/software/python3.3/lib/python3.3/multiprocessing/process.py", line 95, in run
self._target(*self._args, **self._kwargs)
File "./compute_scores_multiprocess.py", line 104, in writer
hdf5_scores[data[0], data[1]:data[2]+1] = numpy.matrix(data[3:])
File "/local/software/python3.3/lib/python3.3/site-packages/h5py/_hl/dataset.py", line 551, in __setitem__
self.id.write(mspace, fspace, val, mtype)
File "h5d.pyx", line 217, in h5py.h5d.DatasetID.write (h5py/h5d.c:2925)
File "_proxy.pyx", line 120, in h5py._proxy.dset_rw (h5py/_proxy.c:1491)
File "_proxy.pyx", line 93, in h5py._proxy.H5PY_H5Dwrite (h5py/_proxy.c:1301)
OSError: can't write data (Dataset: Write failed)
If I insert a pause of two seconds (time.sleep(2)) among writing tasks then the problem seems solved (although I cannot waste 2 seconds per writing since I need to write more than 250.000 times). If I capture the writing exception and put the faulty array in the queue, the script will never stop (presumably).
I am using CentOS (2.6.32-279.11.1.el6.x86_64). Any insight?
Thanks a lot.
When using the multiprocessing module with HDF5, the only big restriction is that you can't have any files open (even read-only) when fork() is called. In other words, if you open a file in the master process to write, and then Python spins off a subprocess for computation, there may be problems. It has to do with how fork() works and the choices HDF5 itself makes about how to handle file descriptors.
My advice is to double-check your application to make sure you're creating any Pools, etc. before opening the master file for writing.
Related
File "/usr/lib/python3.6/multiprocessing/pool.py", line 266, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/usr/lib/python3.6/multiprocessing/pool.py", line 644, in get
raise self._value
File "/usr/lib/python3.6/multiprocessing/pool.py", line 424, in _handle_tasks
put(task)
File "/usr/lib/python3.6/multiprocessing/connection.py", line 206, in send
self._send_bytes(_ForkingPickler.dumps(obj))
File "/usr/lib/python3.6/multiprocessing/connection.py", line 393, in _send_bytes
header = struct.pack("!i", n)
struct.error: 'i' format requires -2147483648 <= number <= 2147483647
I ran a python 3.6 program that uses multiprocessing and I got this error.
As I know, it's a problem of pickle size limit during multiprocessing.
I am also aware of pickle protocol 4 which lets you pass bigger data.
but the problem is that the program uses multiprocessing inside a Cython written compiled SO file module andI can't change the SO file since I don't have the source code for the pyx.
It's also not possible to shrink the input size before it is used in the SO file module function cause I don't know what exactly goes on inside the SO file module and it gets big while running the SO module function.
The only way I can think of is to fix the python multiprocessing package to use pickle protocol 4 by default.
If it's possible could someone please tell me how?
Or if there is another possible way to fix txme
I'm trying to train research model ssd_mobilenet_v1_fpn_640x640_coco17_tpu-8 using the MultiWorkerMirroredStrategy (by setting --num_workers=2 in the invocation of model_main_tf2.py). I'm trying to train across two workers (0 and 1), each with a single GPU. However, when I attempt this I get the following error, always on worker 1:
Traceback (most recent call last):
File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\tensorflow\python\distribute\input_lib.py", line 553, in __next__
return self.get_next()
File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\tensorflow\python\distribute\input_lib.py", line 610, in get_next
return self._get_next_no_partial_batch_handling(name)
File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\tensorflow\python\distribute\input_lib.py", line 642, in _get_next_no_partial_batch_handling
replicas.extend(self._iterators[i].get_next_as_list(new_name))
File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\tensorflow\python\distribute\input_lib.py", line 1594, in get_next_as_list
return self._format_data_list_with_options(self._iterator.get_next())
File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\tensorflow\python\data\ops\multi_device_iterator_ops.py", line 580, in get_next
result.append(self._device_iterators[i].get_next())
File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\tensorflow\python\data\ops\iterator_ops.py", line 889, in get_next
return self._next_internal()
File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\tensorflow\python\data\ops\iterator_ops.py", line 819, in _next_internal
ret = gen_dataset_ops.iterator_get_next(
File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\tensorflow\python\ops\gen_dataset_ops.py", line 2922, in iterator_get_next
_ops.raise_from_not_ok_status(e, name)
File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\tensorflow\python\framework\ops.py", line 7186, in raise_from_not_ok_status
raise core._status_to_exception(e) from None # pylint: disable=protected-access
tensorflow.python.framework.errors_impl.OutOfRangeError: End of sequence [Op:IteratorGetNext]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\JS\Desktop\Tensorflow\models\research\object_detection\model_main_tf2.py", line 114, in <module>
tf.compat.v1.app.run()
File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\tensorflow\python\platform\app.py", line 36, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\absl\app.py", line 312, in run
_run_main(main, args)
File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\absl\app.py", line 258, in _run_main
sys.exit(main(argv))
File "C:\Users\JS\Desktop\Tensorflow\models\research\object_detection\model_main_tf2.py", line 105, in main
model_lib_v2.train_loop(
File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\object_detection\model_lib_v2.py", line 605, in train_loop
load_fine_tune_checkpoint(
File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\object_detection\model_lib_v2.py", line 401, in load_fine_tune_checkpoint
_ensure_model_is_built(model, input_dataset, unpad_groundtruth_tensors)
File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\object_detection\model_lib_v2.py", line 161, in _ensure_model_is_built
features, labels = iter(input_dataset).next()
File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\tensorflow\python\distribute\input_lib.py", line 549, in next
return self.__next__()
File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\tensorflow\python\distribute\input_lib.py", line 555, in __next__
raise StopIteration
StopIteration
Worker 0 eventually fails after detecting that worker 1 has gone down.
This error happens regardless of the physical machines on which the two workers run. In other words I see it if I'm running both workers on a single machine (using localhost) OR different machines on the same network.
Based on the trace in the error messages, the error appears to be occurring whenever the training loop attempts to iterate over the training data generated by strategy.experimental_distribute_datasets_from_function. Note that if I change the strategy to MirroredStrategy it runs fine on a single machine (no other changes made). I'm not sure if I'm doing something wrong or if there is a bug in the object detection API.
My setup on both machines is identical (I basically followed the setup instructions on the object detection web-site):
Windows 10
Tensorflow 2.8.0
Cuda Toolkit 11.2
cudnn 8.1
Has anyone ever seen this error before? If so, is there a way around it?
Ok, I think I understand the issue. In the object detection library there is a file called dataset_builder.py that builds the training dataset from the TFRecord stored in the file specified in the pipeline.config file (in the input_path item of the tf_record_input_reader). The function that actually reads the TFRecord file is _read_dataset_internal. This function treats the input_path of the pipeline config as a LIST OF FILES and then applies a sharding function (passed as an argument) to divide the files between the replicas doing the training (one replica per worker). Since my input_path only specified a single TFRecord file it was assigned to the first replica and the other replicas were given empty filenames!! Thus only the first replica actually had an input dataset to work with, hence the crash.
The solution was to split the training data across two files (two TFRecords) and then set the input_path in pipeline.config to be a list of paths rather than a single path. Once I did this it appears as though the model trained successfully (at least it didn't crash).
I'm not sure if this is a bug in the object detection code or not. I assumed that if I only had one training record (visible to both workers) that both workers would use it and just batch the data accordingly. I'm just not sure if the assumption itself is wrong or if the assumption is correct and the code is wrong.
Anyway, I this helps anyone who might be wrestling with the same issue.
The company I work for distributes document assembly software that uses the python-docx library. The software runs a function on every generated document that opens the document and does a simple search and replace for characters that weren't escaped properly (namely "& amp;" -> "&").
FYI The actual document assembly uses python-docx-template. However, the error happens after the document has already been assembled and the error is triggered by the search-and-replace function, which only uses python-docx.
Recently, we've had a few cases where documents are failing to generate on client deployments. They're throwing an error on this line where the document object is instantiated:
doc = Document(docx=Path(doc_path))
We've seen two errors:
raise BadZipFile("Bad magic number for file header")
and
raise EOFError
The software is widely used and we've never had this issue before. We can't reproduce it in our test environments. The error has only started appearing in the past week but has shown up for several clients after they were updated. The software will fail to generate a particular document some number of times but will succeed after a few tries.
We've only seen it happen with one document in particular, but all documents use the same search and replace function, and like I said the error is only intermittent with the problem document.
There have been no changes in code to this search and replace function and I can't think of any other meaningful difference to our doc assembly process that would explain this.
I'm having a lot of trouble finding info on what could cause this specifically with the python-docx library. Is this a sign that the generated document is corrupted? If anyone is able to shed some light on possible causes that would be very helpful!
Here's the stack trace for both errors:
Bad magic number...
File "/home/user/app/application/document_assembly/core_da.py", line 524, in translate_ampersands
doc = Document(docx=Path(doc_path))
File "/home/user/app-venv/lib/python3.6/site-packages/docx/api.py", line 25, in Document
document_part = Package.open(docx).main_document_part
File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/package.py", line 116, in open
pkg_reader = PackageReader.from_file(pkg_file)
File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/pkgreader.py", line 36, in from_file
phys_reader, pkg_srels, content_types
File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/pkgreader.py", line 69, in _load_serialized_parts
for partname, blob, reltype, srels in part_walker:
File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/pkgreader.py", line 104, in _walk_phys_parts
part_srels = PackageReader._srels_for(phys_reader, partname)
File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/pkgreader.py", line 83, in _srels_for
rels_xml = phys_reader.rels_xml_for(source_uri)
File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/phys_pkg.py", line 129, in rels_xml_for
rels_xml = self.blob_for(source_uri.rels_uri)
File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/phys_pkg.py", line 108, in blob_for
return self._zipf.read(pack_uri.membername)
File "/usr/lib/python3.6/zipfile.py", line 1337, in read
with self.open(name, "r", pwd) as fp:
File "/usr/lib/python3.6/zipfile.py", line 1396, in open
raise BadZipFile("Bad magic number for file header")
zipfile.BadZipFile: Bad magic number for file header
EOFError
File "/home/user/app/application/document_assembly/core_da.py", line 524, in translate_ampersands
doc = Document(docx=Path(doc_path))
File "/home/user/app-venv/lib/python3.6/site-packages/docx/api.py", line 25, in Document
document_part = Package.open(docx).main_document_part
File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/package.py", line 116, in open
pkg_reader = PackageReader.from_file(pkg_file)
File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/pkgreader.py", line 36, in from_file
phys_reader, pkg_srels, content_types
File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/pkgreader.py", line 69, in _load_serialized_parts
for partname, blob, reltype, srels in part_walker:
File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/pkgreader.py", line 110, in _walk_phys_parts
for partname, blob, reltype, srels in next_walker:
File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/pkgreader.py", line 105, in _walk_phys_parts
blob = phys_reader.blob_for(partname)
File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/phys_pkg.py", line 108, in blob_for
return self._zipf.read(pack_uri.membername)
File "/usr/lib/python3.6/zipfile.py", line 1338, in read
return fp.read()
File "/usr/lib/python3.6/zipfile.py", line 858, in read
buf += self._read1(self.MAX_N)
File "/usr/lib/python3.6/zipfile.py", line 940, in _read1
data += self._read2(n - len(data))
File "/usr/lib/python3.6/zipfile.py", line 975, in _read2
raise EOFError
EOFError
Both of these errors indicate that the specified file is not a valid zip archive. So I expect something is going wrong with the writing of the file (by the step prior to find-and-replace).
I would start by stopping the process after writing the file and seeing if the file is present on the filesystem and whether it can be opened manually using Word. This should bisect the problem and narrow it down to a writing problem or a reading problem.
It could be possible that an error is raised on the write and it's not being caught or whatever, leaving an empty or un-flushed (open) file. So having a way to monitor that step is probably a good idea. Writing to a log comes to mind as how you might manage that.
Inspecting the particular cases where there is a failure and managing to reproduce it are going to be critically important. If that's not possible, it's going to be a tough road of guesswork and disappointment on both sides.
It turns out some code was added recently before this started happening, which effectively sent a duplicate request to the server to generate the document in question. These requests seem to run in parallel - which is surprising because I would predict the conflict to happen much more frequently (same template file being used, generated document writing to the same directory).
Seems like if the sequence of the requests happened in a particular timing, the "find-and-replace" operation of one request would run into the "save" operation of the other request. So in other words I think one request was trying to open a document that was in the process of being saved.
So I'm glad it's not something more obscure with the python-docx library, which would have been a lot harder to nail down.
My main python thread does not use asyncio, but it creates a new thread whose code uses asyncio, where it encountered an error when calling get_event_loop():
Traceback (most recent call last):
File "/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py", line 926, in _bootstrap_inner
self.run()
File "/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/lib/python3.7/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/Users/hanxu/work/thunderrock/node_server/quic/udp_async.py", line 33, in udp_async
loop = asyncio.get_event_loop()
File "/usr/local/Cellar/python/3.7.5/Frameworks/Python.framework/Versions/3.7/lib/python3.7/asyncio/events.py", line 644, in get_event_loop
% threading.current_thread().name)
RuntimeError: There is no current event loop in thread 'Thread-5'.
Per online document of Python 3.7, get_event_loop should automatically create a new event loop if no existing yet. Why does it fail in this case? Is it because of threading?
Btw I am doing the following to workaround, but still wondering why the issue existed:
try:
loop = asyncio.get_event_loop()
except RuntimeError:
loop = asyncio.new_event_loop()
asyncio.set_event_loop(loop)
Thanks for the question.
The documentation is slightly incorrect, new even loop is implicitly created only for the main thread.
Please track https://bugs.python.org/issue39381 for the fix.
I have run into a strange behaviour at multiprocessing.
When i try to use a global variable in a function which is called from multiprocessing it does not see a global variable.
Example:
import multiprocessing
def func(useless_variable):
print(variable)
useless_list = [1,2,3,4,5,6]
p = multiprocessing.Pool(processes=multiprocessing.cpu_count())
variable = "asd"
func(useless_list)
for x in p.imap_unordered(func, useless_list):
pass
Output:
asd
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/lib/python3.4/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "pywork/asd.py", line 4, in func
print(variable)
NameError: name 'variable' is not defined
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "pywork/asd.py", line 11, in <module>
for x in p.imap_unordered(func, useless_list):
File "/usr/lib/python3.4/multiprocessing/pool.py", line 689, in next
raise value
NameError: name 'variable' is not defined
As you see the first time i just simply call func it print asd as expected. However when i call the very same function with multiprocessing it says the variable variable does not exists, even after i clearly printed it just before.
Does multiprocessing ignore global variables? How can i work this around?
multiprocessing Pools fork (or spawn in a way intended to mimic forking on Windows) its worker processes at the moment the Pool is created. forking maps the parent memory as copy-on-write in the children, but it doesn't create a persistent tie between them; after the fork, changes made in the parent are not visible in the children, and vice versa. You can't use any variables defined after the Pool was created, and changes made to variables from before the Pool was created will not be reflected in the workers.
Typically, with a Pool, you want to avoid mutable global state entirely; have all the data needed passed to the function you're imap-ing (or whatever) as arguments (which are serialized and sent to the children, so the state is correct), and have the function return any new data instead of mutating globals, which serializes it and sends it back to the parent process to use as it sees fit.
Managerss are an option, but not usually the correct option with Pools; you usually want to stick to workers only looking at read only globals from before the Pool was created, or working with arguments and returning new values, not using global state at all.
When you spam a process all context is copyed, you need to get use of managers for exachanging objects between them, check the official documentations, for managing state check this.