Error using MultiWorkerMirroredStrategy to train object detection research model ssd_mobilenet_v1_fpn_640x640_coco17_tpu-8 - python-3.x

I'm trying to train research model ssd_mobilenet_v1_fpn_640x640_coco17_tpu-8 using the MultiWorkerMirroredStrategy (by setting --num_workers=2 in the invocation of model_main_tf2.py). I'm trying to train across two workers (0 and 1), each with a single GPU. However, when I attempt this I get the following error, always on worker 1:
Traceback (most recent call last):
File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\tensorflow\python\distribute\input_lib.py", line 553, in __next__
return self.get_next()
File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\tensorflow\python\distribute\input_lib.py", line 610, in get_next
return self._get_next_no_partial_batch_handling(name)
File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\tensorflow\python\distribute\input_lib.py", line 642, in _get_next_no_partial_batch_handling
replicas.extend(self._iterators[i].get_next_as_list(new_name))
File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\tensorflow\python\distribute\input_lib.py", line 1594, in get_next_as_list
return self._format_data_list_with_options(self._iterator.get_next())
File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\tensorflow\python\data\ops\multi_device_iterator_ops.py", line 580, in get_next
result.append(self._device_iterators[i].get_next())
File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\tensorflow\python\data\ops\iterator_ops.py", line 889, in get_next
return self._next_internal()
File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\tensorflow\python\data\ops\iterator_ops.py", line 819, in _next_internal
ret = gen_dataset_ops.iterator_get_next(
File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\tensorflow\python\ops\gen_dataset_ops.py", line 2922, in iterator_get_next
_ops.raise_from_not_ok_status(e, name)
File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\tensorflow\python\framework\ops.py", line 7186, in raise_from_not_ok_status
raise core._status_to_exception(e) from None # pylint: disable=protected-access
tensorflow.python.framework.errors_impl.OutOfRangeError: End of sequence [Op:IteratorGetNext]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\JS\Desktop\Tensorflow\models\research\object_detection\model_main_tf2.py", line 114, in <module>
tf.compat.v1.app.run()
File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\tensorflow\python\platform\app.py", line 36, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\absl\app.py", line 312, in run
_run_main(main, args)
File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\absl\app.py", line 258, in _run_main
sys.exit(main(argv))
File "C:\Users\JS\Desktop\Tensorflow\models\research\object_detection\model_main_tf2.py", line 105, in main
model_lib_v2.train_loop(
File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\object_detection\model_lib_v2.py", line 605, in train_loop
load_fine_tune_checkpoint(
File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\object_detection\model_lib_v2.py", line 401, in load_fine_tune_checkpoint
_ensure_model_is_built(model, input_dataset, unpad_groundtruth_tensors)
File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\object_detection\model_lib_v2.py", line 161, in _ensure_model_is_built
features, labels = iter(input_dataset).next()
File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\tensorflow\python\distribute\input_lib.py", line 549, in next
return self.__next__()
File "C:\Users\JS\.conda\envs\tensor2\lib\site-packages\tensorflow\python\distribute\input_lib.py", line 555, in __next__
raise StopIteration
StopIteration
Worker 0 eventually fails after detecting that worker 1 has gone down.
This error happens regardless of the physical machines on which the two workers run. In other words I see it if I'm running both workers on a single machine (using localhost) OR different machines on the same network.
Based on the trace in the error messages, the error appears to be occurring whenever the training loop attempts to iterate over the training data generated by strategy.experimental_distribute_datasets_from_function. Note that if I change the strategy to MirroredStrategy it runs fine on a single machine (no other changes made). I'm not sure if I'm doing something wrong or if there is a bug in the object detection API.
My setup on both machines is identical (I basically followed the setup instructions on the object detection web-site):
Windows 10
Tensorflow 2.8.0
Cuda Toolkit 11.2
cudnn 8.1
Has anyone ever seen this error before? If so, is there a way around it?

Ok, I think I understand the issue. In the object detection library there is a file called dataset_builder.py that builds the training dataset from the TFRecord stored in the file specified in the pipeline.config file (in the input_path item of the tf_record_input_reader). The function that actually reads the TFRecord file is _read_dataset_internal. This function treats the input_path of the pipeline config as a LIST OF FILES and then applies a sharding function (passed as an argument) to divide the files between the replicas doing the training (one replica per worker). Since my input_path only specified a single TFRecord file it was assigned to the first replica and the other replicas were given empty filenames!! Thus only the first replica actually had an input dataset to work with, hence the crash.
The solution was to split the training data across two files (two TFRecords) and then set the input_path in pipeline.config to be a list of paths rather than a single path. Once I did this it appears as though the model trained successfully (at least it didn't crash).
I'm not sure if this is a bug in the object detection code or not. I assumed that if I only had one training record (visible to both workers) that both workers would use it and just batch the data accordingly. I'm just not sure if the assumption itself is wrong or if the assumption is correct and the code is wrong.
Anyway, I this helps anyone who might be wrestling with the same issue.

Related

RuntimeError: unexpected EOF, expected 3302200 more bytes. The file might be corrupted

I am trying to implement pretrained model of following repository. I need your assistance to rectify the error.
RuntimeError: unexpected EOF, expected 3302200 more bytes. The file might be corrupted.
I tried to implement pretrained model of CANNet present on following repo using google Collab and followed all steps of (Prerequisites, cloning, Data Preparation, and Testing)
https://github.com/gjy3035/NWPU-Crowd-Sample-Code.git
The detailed error is given below
Traceback (most recent call last):
File "test.py", line 118, in
main()
File "test.py", line 46, in main
test(lines, model_path)
File "test.py", line 55, in test
net.load_state_dict(torch.load(model_path))
File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 593, in load
return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File "/usr/local/lib/python3.7/dist-packages/torch/serialization.py", line 779, in _legacy_load
deserialized_objects[key]._set_from_file(f, offset, f_should_read_directly)
RuntimeError: unexpected EOF, expected 3302200 more bytes. The file might be corrupted.
Check out this github link: https://github.com/huggingface/transformers/issues/1491
It proposes one should use the force_download arg. This is equivalent to force_reload assuming you're using torch.load.hub to load the pretrained model. The other option proposed is applicable to windows users is to delete the downloaded model and download it again.
I have the same issue but --setting force_reload=True hasn't cleared it for me, I'm thinking I have space problems, but I think it's worth a shot on your end.
I also faced the same same problem while I was evaluating my trained model on google collab. I found that the model was taking a lot of time to get fully uploaded to the machine. I was testing with the incompletely uploaded model. when I ensured that the model has been fully uploaded and then I ran, it worked.

python-docx: Error opening file - "Bad magic number for file header" / "EOFError"

The company I work for distributes document assembly software that uses the python-docx library. The software runs a function on every generated document that opens the document and does a simple search and replace for characters that weren't escaped properly (namely "& amp;" -> "&").
FYI The actual document assembly uses python-docx-template. However, the error happens after the document has already been assembled and the error is triggered by the search-and-replace function, which only uses python-docx.
Recently, we've had a few cases where documents are failing to generate on client deployments. They're throwing an error on this line where the document object is instantiated:
doc = Document(docx=Path(doc_path))
We've seen two errors:
raise BadZipFile("Bad magic number for file header")
and
raise EOFError
The software is widely used and we've never had this issue before. We can't reproduce it in our test environments. The error has only started appearing in the past week but has shown up for several clients after they were updated. The software will fail to generate a particular document some number of times but will succeed after a few tries.
We've only seen it happen with one document in particular, but all documents use the same search and replace function, and like I said the error is only intermittent with the problem document.
There have been no changes in code to this search and replace function and I can't think of any other meaningful difference to our doc assembly process that would explain this.
I'm having a lot of trouble finding info on what could cause this specifically with the python-docx library. Is this a sign that the generated document is corrupted? If anyone is able to shed some light on possible causes that would be very helpful!
Here's the stack trace for both errors:
Bad magic number...
File "/home/user/app/application/document_assembly/core_da.py", line 524, in translate_ampersands
doc = Document(docx=Path(doc_path))
File "/home/user/app-venv/lib/python3.6/site-packages/docx/api.py", line 25, in Document
document_part = Package.open(docx).main_document_part
File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/package.py", line 116, in open
pkg_reader = PackageReader.from_file(pkg_file)
File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/pkgreader.py", line 36, in from_file
phys_reader, pkg_srels, content_types
File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/pkgreader.py", line 69, in _load_serialized_parts
for partname, blob, reltype, srels in part_walker:
File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/pkgreader.py", line 104, in _walk_phys_parts
part_srels = PackageReader._srels_for(phys_reader, partname)
File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/pkgreader.py", line 83, in _srels_for
rels_xml = phys_reader.rels_xml_for(source_uri)
File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/phys_pkg.py", line 129, in rels_xml_for
rels_xml = self.blob_for(source_uri.rels_uri)
File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/phys_pkg.py", line 108, in blob_for
return self._zipf.read(pack_uri.membername)
File "/usr/lib/python3.6/zipfile.py", line 1337, in read
with self.open(name, "r", pwd) as fp:
File "/usr/lib/python3.6/zipfile.py", line 1396, in open
raise BadZipFile("Bad magic number for file header")
zipfile.BadZipFile: Bad magic number for file header
EOFError
File "/home/user/app/application/document_assembly/core_da.py", line 524, in translate_ampersands
doc = Document(docx=Path(doc_path))
File "/home/user/app-venv/lib/python3.6/site-packages/docx/api.py", line 25, in Document
document_part = Package.open(docx).main_document_part
File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/package.py", line 116, in open
pkg_reader = PackageReader.from_file(pkg_file)
File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/pkgreader.py", line 36, in from_file
phys_reader, pkg_srels, content_types
File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/pkgreader.py", line 69, in _load_serialized_parts
for partname, blob, reltype, srels in part_walker:
File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/pkgreader.py", line 110, in _walk_phys_parts
for partname, blob, reltype, srels in next_walker:
File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/pkgreader.py", line 105, in _walk_phys_parts
blob = phys_reader.blob_for(partname)
File "/home/user/app-venv/lib/python3.6/site-packages/docx/opc/phys_pkg.py", line 108, in blob_for
return self._zipf.read(pack_uri.membername)
File "/usr/lib/python3.6/zipfile.py", line 1338, in read
return fp.read()
File "/usr/lib/python3.6/zipfile.py", line 858, in read
buf += self._read1(self.MAX_N)
File "/usr/lib/python3.6/zipfile.py", line 940, in _read1
data += self._read2(n - len(data))
File "/usr/lib/python3.6/zipfile.py", line 975, in _read2
raise EOFError
EOFError
Both of these errors indicate that the specified file is not a valid zip archive. So I expect something is going wrong with the writing of the file (by the step prior to find-and-replace).
I would start by stopping the process after writing the file and seeing if the file is present on the filesystem and whether it can be opened manually using Word. This should bisect the problem and narrow it down to a writing problem or a reading problem.
It could be possible that an error is raised on the write and it's not being caught or whatever, leaving an empty or un-flushed (open) file. So having a way to monitor that step is probably a good idea. Writing to a log comes to mind as how you might manage that.
Inspecting the particular cases where there is a failure and managing to reproduce it are going to be critically important. If that's not possible, it's going to be a tough road of guesswork and disappointment on both sides.
It turns out some code was added recently before this started happening, which effectively sent a duplicate request to the server to generate the document in question. These requests seem to run in parallel - which is surprising because I would predict the conflict to happen much more frequently (same template file being used, generated document writing to the same directory).
Seems like if the sequence of the requests happened in a particular timing, the "find-and-replace" operation of one request would run into the "save" operation of the other request. So in other words I think one request was trying to open a document that was in the process of being saved.
So I'm glad it's not something more obscure with the python-docx library, which would have been a lot harder to nail down.

How to track down errors in Flopy created MODFLOW output

Since a few weeks i'm writing writing MODFLOW models with Flopy in Python. I chose to write models in Flopy because of the transparency of Python. However, once in a while my model doesn't run but it doesn't tell me where it goes wrong, that makes error handeling difficult.
At the moment my model gives an error at running the model. It raises the error with the message I added manually (and is common used):
success, mfoutput = mf.run_model(silent=False, pause=False)
if not success:
raise Exception('MODFLOW did not terminate normally.')
The error I got is:
FloPy is using the following executable to run the model: /usr/bin/mf2005
MODFLOW-2005
U.S. GEOLOGICAL SURVEY MODULAR FINITE-DIFFERENCE GROUND-WATER FLOW MODEL
Version 1.12.00 2/3/2017
Using NAME file: spangen_mod.nam
Run start date and time (yyyy/mm/dd hh:mm:ss): 2019/04/23 16:12:39
Traceback (most recent call last):
File "<ipython-input-85-c7ecca798eed>", line 1, in <module>
runfile('/Users/user/Desktop/modflow/model.py', wdir='/Users/user/Desktop/modflow')
File "/Users/user/anaconda3/lib/python3.7/site-packages/spyder_kernels/customize/spydercustomize.py", line 704, in runfile
execfile(filename, namespace)
File "/Users/user/anaconda3/lib/python3.7/site-packages/spyder_kernels/customize/spydercustomize.py", line 108, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "/Users/user/Desktop/modflow/model.py", line 223, in <module>
raise Exception('MODFLOW did not terminate normally.')
Exception: MODFLOW did not terminate normally.
Besides, all the MODFLOW files are created but .hds and .cbc file contain zero bytes.
My question: does someone have tips to track down these kind of errors efficient and smart.

HDF5 inflate error on windows

I've being using a dataset in HDF5 file to feed a convolutional neural network during the training.,
My code is written in Python using the Keras library. The function which provides the data to the CNN is a generator. When the computer is not being used, the processing works fine without any error. However, when I start to use it, the code aborts with an inflate error:
...
File "E:\MATLAB\2018, Autoencoder\utils.py", line 311, in listImgNameToBatch
im_tmp = self.hf_AD.get('/AD/' + im_name)[im_idx]
File "h5py\_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py\_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "C:\Users\Helder\Miniconda3\lib\site-packages\h5py\_hl\dataset.py", line 496, in __getitem__
self.id.read(mspace, fspace, arr, mtype, dxpl=self._dxpl)
File "h5py\_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py\_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py\h5d.pyx", line 181, in h5py.h5d.DatasetID.read
File "h5py\_proxy.pyx", line 130, in h5py._proxy.dset_rw
File "h5py\_proxy.pyx", line 84, in h5py._proxy.H5PY_H5Dread
OSError: Can't read data (inflate() failed)
The dataset is not corrupted, since I checked the MD5 hash of it. It was created using the compression level 5. The error does not happen in the same point. Sometimes the training pass 20 epochs to abort, sometimes 5 epochs, depends if the computer is being used... And more than that it does not raises the error immediately, it takes a while. For me its sort of random event which causes the reading to crash.
Does anybody is experiencing the same behavior? Is that a problem of HDF5 lib or of windows? Unfortunately, I dont have a linux machine to test it and using a VM is not a option since I'm processing in GPU.
Any information will be welcome.

h5py, sporadic writing errors

I have some float numbers to be stored in a big (500K x 500K) matrix. I am storing them in chunks, by using arrays of variable sizes (in accordance to some specific conditions).
I have a parallellised code (Python3.3 and h5py) which produces the arrays and put them in a shared queue, and one dedicated process that pops from the queue and writes them one-by-one in the HDF5 matrix. It works as expected approximately 90% of the time.
Occasionally, I got writing errors for specific arrays. If I run it multiple times, the faulty arrays change all the times.
Here's the code:
def writer(in_q):
# Open HDF5 archive
hdf5_file = h5py.File("./google_matrix_test.hdf5")
hdf5_scores = hdf5_file['scores']
while True:
# Get some data
try:
data = in_q.get(timeout=5)
except:
hdf5_file.flush()
print('HDF5 archive updated.')
break
# Process the data
try:
hdf5_scores[data[0], data[1]:data[2]+1] = numpy.matrix(data[3:])
except:
# Print faulty chunk's info
print('E: ' + str(data[0:3]))
in_q.put(data) # <- doesn't solve
in_q.task_done()
def compute():
jobs_queue = JoinableQueue()
scores_queue = JoinableQueue()
processes = []
processes.append(Process(target=producer, args=(jobs_queue, data,)))
processes.append(Process(target=writer, args=(scores_queue,)))
for i in range(10):
processes.append(Process(target=consumer, args=(jobs_queue,scores_queue,)))
for p in processes:
p.start()
processes[1].join()
scores_queue.join()
Here's the error:
Process Process-2:
Traceback (most recent call last):
File "/local/software/python3.3/lib/python3.3/multiprocessing/process.py", line 258, in _bootstrap
self.run()
File "/local/software/python3.3/lib/python3.3/multiprocessing/process.py", line 95, in run
self._target(*self._args, **self._kwargs)
File "./compute_scores_multiprocess.py", line 104, in writer
hdf5_scores[data[0], data[1]:data[2]+1] = numpy.matrix(data[3:])
File "/local/software/python3.3/lib/python3.3/site-packages/h5py/_hl/dataset.py", line 551, in __setitem__
self.id.write(mspace, fspace, val, mtype)
File "h5d.pyx", line 217, in h5py.h5d.DatasetID.write (h5py/h5d.c:2925)
File "_proxy.pyx", line 120, in h5py._proxy.dset_rw (h5py/_proxy.c:1491)
File "_proxy.pyx", line 93, in h5py._proxy.H5PY_H5Dwrite (h5py/_proxy.c:1301)
OSError: can't write data (Dataset: Write failed)
If I insert a pause of two seconds (time.sleep(2)) among writing tasks then the problem seems solved (although I cannot waste 2 seconds per writing since I need to write more than 250.000 times). If I capture the writing exception and put the faulty array in the queue, the script will never stop (presumably).
I am using CentOS (2.6.32-279.11.1.el6.x86_64). Any insight?
Thanks a lot.
When using the multiprocessing module with HDF5, the only big restriction is that you can't have any files open (even read-only) when fork() is called. In other words, if you open a file in the master process to write, and then Python spins off a subprocess for computation, there may be problems. It has to do with how fork() works and the choices HDF5 itself makes about how to handle file descriptors.
My advice is to double-check your application to make sure you're creating any Pools, etc. before opening the master file for writing.

Resources