ResourceException when use GPU but not CPU on Azure

ResourceException when use GPU but not CPU on Azure - azure

My code is able to build the graph successfully and run graph in CPU mode on Azure ML, but GPU reports a ResourceException in the graph building phase.
I switch between CPU and GPU modes by simply removing device command:
with tf.device('/cpu:0'), tf.name_scope('embedding'): #cpu mode runs fine
with tf.name_scope('embedding'): #gpu mode throw exception
I tried loading less data but didn't work either.
I suspect I missed some steps when set up GPU. Any idea?
Azure error msg:
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[78298,300]
[[Node: embedding_matrix/Assign = Assign[T=DT_FLOAT, _class=["loc:#embedding_matrix"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](embedding_matrix, embedding_matrix/Initializer/Const)]]
Complete error msg:
Traceback (most recent call last):
File "/anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1323, in _do_call
return fn(*args)
File "/anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1302, in _run_fn
status, run_metadata)
File "/anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in exit
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[78298,300]
[[Node: embedding_matrix/Assign = Assign[T=DT_FLOAT, _class=["loc:#embedding_matrix"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](embedding_matrix, embedding_matrix/Initializer/Const)]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "NN.py", line 130, in
sess.run(init)
File "/anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 889, in run
run_metadata_ptr)
File "/anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1120, in _run
feed_dict_tensor, options, run_metadata)
File "/anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1317, in _do_run
options, run_metadata)
File "/anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1336, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[78298,300]
[[Node: embedding_matrix/Assign = Assign[T=DT_FLOAT, _class=["loc:#embedding_matrix"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](embedding_matrix, embedding_matrix/Initializer/Const)]]
Caused by op 'embedding_matrix/Assign', defined at:
File "NN.py", line 120, in
, trainable=False)
File "/anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/ops/variable_scope.py", line 1203, in get_variable
constraint=constraint)
File "/anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/ops/variable_scope.py", line 1092, in get_variable
constraint=constraint)
File "/anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/ops/variable_scope.py", line 425, in get_variable
constraint=constraint)
File "/anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/ops/variable_scope.py", line 394, in _true_getter
use_resource=use_resource, constraint=constraint)
File "/anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/ops/variable_scope.py", line 805, in _get_single_variable
constraint=constraint)
File "/anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/ops/variables.py", line 213, in init
constraint=constraint)
File "/anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/ops/variables.py", line 346, in _init_from_args
validate_shape=validate_shape).op
File "/anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/ops/state_ops.py", line 276, in assign
validate_shape=validate_shape)
File "/anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/ops/gen_state_ops.py", line 57, in assign
use_locking=use_locking, name=name)
File "/anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
op_def=op_def)
File "/anaconda/envs/py35/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1470, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[78298,300]
[[Node: embedding_matrix/Assign = Assign[T=DT_FLOAT, _class=["loc:#embedding_matrix"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/device:GPU:0"](embedding_matrix, embedding_matrix/Initializer/Const)]]

Host memory is quite a bit larger that device memory for an N-series machine.
Are you sure you simply aren't exceeding the device capacity?

Related

Four model training simultaneously in a single machine with 8 gpus

I have a 8-gpu workstation for deep learning using tensorflow.
The work station specifications are as follows:
Intel Xeon Gold x2
NVIDIA Quadro RTXA6000 x8
RAM 1TB
I use:
Python 3.8
UBUNTU 20.04
Tensorflow 2.8
CUDA 11.2
cuDNN 8.1
I hope to run four jupyter notebooks simultaneously to train four models. I think using 8 gpus for a single model training is not very efficient. However, the notebook died during training and I got several warning messages.
Do you have any ideas on how to fix these errors and are these related to running multiple notebooks?
[E 13:32:28.388 NotebookApp] Uncaught exception in ZMQStream callback
Traceback (most recent call last):
File "/home/super/researchvenv/lib/python3.8/site-packages/zmq/eventloop/zmqstream.py", line 584, in _run_callback
f = callback(*args, **kwargs)
File "/home/super/researchvenv/lib/python3.8/site-packages/zmq/eventloop/zmqstream.py", line 308, in stream_callback
return callback(self, msg)
File "/home/super/researchvenv/lib/python3.8/site-packages/notebook/services/kernels/handlers.py", line 572, in _on_zmq_reply
super()._on_zmq_reply(stream, msg)
File "/home/super/researchvenv/lib/python3.8/site-packages/notebook/base/zmqhandlers.py", line 256, in _on_zmq_reply
self.write_message(msg, binary=isinstance(msg, bytes))
File "/home/super/researchvenv/lib/python3.8/site-packages/tornado/websocket.py", line 339, in write_message
return self.ws_connection.write_message(message, binary=binary)
File "/home/super/researchvenv/lib/python3.8/site-packages/tornado/websocket.py", line 1086, in write_message
fut = self._write_frame(True, opcode, message, flags=flags)
File "/home/super/researchvenv/lib/python3.8/site-packages/tornado/websocket.py", line 1061, in _write_frame
return self.stream.write(frame)
File "/home/super/researchvenv/lib/python3.8/site-packages/tornado/iostream.py", line 546, in write
self._handle_write()
File "/home/super/researchvenv/lib/python3.8/site-packages/tornado/iostream.py", line 976, in _handle_write
self._write_buffer.advance(num_bytes)
File "/home/super/researchvenv/lib/python3.8/site-packages/tornado/iostream.py", line 182, in advance
assert 0 < size <= self._size
AssertionError
[E 13:32:28.388 NotebookApp] Uncaught exception in zmqstream callback
Traceback (most recent call last):
File "/home/super/researchvenv/lib/python3.8/site-packages/zmq/eventloop/zmqstream.py", line 621, in _handle_events
self._handle_recv()
File "/home/super/researchvenv/lib/python3.8/site-packages/zmq/eventloop/zmqstream.py", line 650, in _handle_recv
self._run_callback(callback, msg)
File "/home/super/researchvenv/lib/python3.8/site-packages/zmq/eventloop/zmqstream.py", line 584, in _run_callback
f = callback(*args, **kwargs)
File "/home/super/researchvenv/lib/python3.8/site-packages/zmq/eventloop/zmqstream.py", line 308, in stream_callback
return callback(self, msg)
File "/home/super/researchvenv/lib/python3.8/site-packages/notebook/services/kernels/handlers.py", line 572, in _on_zmq_reply
super()._on_zmq_reply(stream, msg)
File "/home/super/researchvenv/lib/python3.8/site-packages/notebook/base/zmqhandlers.py", line 256, in _on_zmq_reply
self.write_message(msg, binary=isinstance(msg, bytes))
File "/home/super/researchvenv/lib/python3.8/site-packages/tornado/websocket.py", line 339, in write_message
return self.ws_connection.write_message(message, binary=binary)
File "/home/super/researchvenv/lib/python3.8/site-packages/tornado/websocket.py", line 1086, in write_message
fut = self._write_frame(True, opcode, message, flags=flags)
File "/home/super/researchvenv/lib/python3.8/site-packages/tornado/websocket.py", line 1061, in _write_frame
return self.stream.write(frame)
File "/home/super/researchvenv/lib/python3.8/site-packages/tornado/iostream.py", line 546, in write
self._handle_write()
File "/home/super/researchvenv/lib/python3.8/site-packages/tornado/iostream.py", line 976, in _handle_write
self._write_buffer.advance(num_bytes)
File "/home/super/researchvenv/lib/python3.8/site-packages/tornado/iostream.py", line 182, in advance
assert 0 < size <= self._size
AssertionError
[E 13:32:28.389 NotebookApp] Exception in callback functools.partial(<function ZMQStream._update_handler.<locals>.<lambda> at 0x7f5da3064310>)
Traceback (most recent call last):
File "/home/super/researchvenv/lib/python3.8/site-packages/tornado/ioloop.py", line 740, in _run_callback
ret = callback()
File "/home/super/researchvenv/lib/python3.8/site-packages/zmq/eventloop/zmqstream.py", line 705, in <lambda>
self.io_loop.add_callback(lambda: self._handle_events(self.socket, 0))
File "/home/super/researchvenv/lib/python3.8/site-packages/zmq/eventloop/zmqstream.py", line 621, in _handle_events
self._handle_recv()
File "/home/super/researchvenv/lib/python3.8/site-packages/zmq/eventloop/zmqstream.py", line 650, in _handle_recv
self._run_callback(callback, msg)
File "/home/super/researchvenv/lib/python3.8/site-packages/zmq/eventloop/zmqstream.py", line 584, in _run_callback
f = callback(*args, **kwargs)
File "/home/super/researchvenv/lib/python3.8/site-packages/zmq/eventloop/zmqstream.py", line 308, in stream_callback
return callback(self, msg)
File "/home/super/researchvenv/lib/python3.8/site-packages/notebook/services/kernels/handlers.py", line 572, in _on_zmq_reply
super()._on_zmq_reply(stream, msg)
File "/home/super/researchvenv/lib/python3.8/site-packages/notebook/base/zmqhandlers.py", line 256, in _on_zmq_reply
self.write_message(msg, binary=isinstance(msg, bytes))
File "/home/super/researchvenv/lib/python3.8/site-packages/tornado/websocket.py", line 339, in write_message
return self.ws_connection.write_message(message, binary=binary)
File "/home/super/researchvenv/lib/python3.8/site-packages/tornado/websocket.py", line 1086, in write_message
fut = self._write_frame(True, opcode, message, flags=flags)
File "/home/super/researchvenv/lib/python3.8/site-packages/tornado/websocket.py", line 1061, in _write_frame
return self.stream.write(frame)
File "/home/super/researchvenv/lib/python3.8/site-packages/tornado/iostream.py", line 546, in write
self._handle_write()
File "/home/super/researchvenv/lib/python3.8/site-packages/tornado/iostream.py", line 976, in _handle_write
self._write_buffer.advance(num_bytes)
File "/home/super/researchvenv/lib/python3.8/site-packages/tornado/iostream.py", line 182, in advance
assert 0 < size <= self._size
AssertionError
Running multiple jupyter notebooks in a single machine for multiple simultaneous TF model training

self._traceback = tf_stack.extract_stack()

I am training a custom ssd_mobilenet_v2_quantized_300x300 TensorFlow model for object detection using Google Colab with the downgraded version of TensorFlow 1.15.2 because I use to train my model on previous version of TensorFlow i.e. 1.14.0 but due to the latest update to version 2.2.0, I get the strange errors and therefore I can't use the latest version.
Using 1.15.2 version and selection even batch size of 8 I successfully starts the training process but after some time, the training process stops with the following errors.
TypeError: 'numpy.float64' object cannot be interpreted as an integer
self._traceback = tf_stack.extract_stack()
My complete training log is as follows;
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/training/evaluation.py", line 272, in _evaluate_once
session.run(eval_ops, feed_dict)
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/training/monitored_session.py", line 754, in run
run_metadata=run_metadata)
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/training/monitored_session.py", line 1259, in run
run_metadata=run_metadata)
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/training/monitored_session.py", line 1360, in run
raise six.reraise(*original_exc_info)
File "/usr/local/lib/python3.6/dist-packages/six.py", line 693, in reraise
raise value
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/training/monitored_session.py", line 1345, in run
return self._sess.run(*args, **kwargs)
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/training/monitored_session.py", line 1418, in run
run_metadata=run_metadata)
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/training/monitored_session.py", line 1176, in run
return self._sess.run(*args, **kwargs)
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/client/session.py", line 956, in run
run_metadata_ptr)
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/client/session.py", line 1180, in _run
feed_dict_tensor, options, run_metadata)
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/client/session.py", line 1359, in _do_run
run_metadata)
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.OutOfRangeError: 2 root error(s) found.
(0) Out of range: End of sequence
[[node IteratorGetNext (defined at tensorflow-1.15.2/python3.6/tensorflow_core/python/framework/ops.py:1748) ]]
(1) Out of range: End of sequence
[[node IteratorGetNext (defined at tensorflow-1.15.2/python3.6/tensorflow_core/python/framework/ops.py:1748) ]]
[[Postprocessor/BatchMultiClassNonMaxSuppression/MultiClassNonMaxSuppression/non_max_suppression_with_scores_1/NonMaxSuppressionV5/_4683]]
0 successful operations.
0 derived errors ignored.
Original stack trace for 'IteratorGetNext':
File "content/models/research/object_detection/model_main.py", line 114, in <module>
tf.app.run()
File "tensorflow-1.15.2/python3.6/tensorflow_core/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "usr/local/lib/python3.6/dist-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "usr/local/lib/python3.6/dist-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "content/models/research/object_detection/model_main.py", line 110, in main
tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0])
File "tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/training.py", line 473, in train_and_evaluate
return executor.run()
File "tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/training.py", line 613, in run
return self.run_local()
File "tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/training.py", line 714, in run_local
saving_listeners=saving_listeners)
File "tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/estimator.py", line 1195, in _train_model_default
saving_listeners)
File "tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/estimator.py", line 1494, in _train_with_estimator_spec
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File "tensorflow-1.15.2/python3.6/tensorflow_core/python/training/monitored_session.py", line 754, in run
run_metadata=run_metadata)
File "tensorflow-1.15.2/python3.6/tensorflow_core/python/training/monitored_session.py", line 1259, in run
run_metadata=run_metadata)
File "tensorflow-1.15.2/python3.6/tensorflow_core/python/training/monitored_session.py", line 1345, in run
return self._sess.run(*args, **kwargs)
File "tensorflow-1.15.2/python3.6/tensorflow_core/python/training/monitored_session.py", line 1426, in run
run_metadata=run_metadata))
File "tensorflow-1.15.2/python3.6/tensorflow_core/python/training/basic_session_run_hooks.py", line 594, in after_run
if self._save(run_context.session, global_step):
File "tensorflow-1.15.2/python3.6/tensorflow_core/python/training/basic_session_run_hooks.py", line 619, in _save
if l.after_save(session, step):
File "tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/training.py", line 519, in after_save
self._evaluate(global_step_value) # updates self.eval_result
File "tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/training.py", line 539, in _evaluate
self._evaluator.evaluate_and_export())
File "tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/training.py", line 920, in evaluate_and_export
hooks=self._eval_spec.hooks)
File "tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/estimator.py", line 480, in evaluate
name=name)
File "tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/estimator.py", line 522, in _actual_eval
return _evaluate()
File "tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/estimator.py", line 504, in _evaluate
self._evaluate_build_graph(input_fn, hooks, checkpoint_path))
File "tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/estimator.py", line 1511, in _evaluate_build_graph
self._call_model_fn_eval(input_fn, self.config))
File "tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/estimator.py", line 1544, in _call_model_fn_eval
input_fn, ModeKeys.EVAL)
File "tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/estimator.py", line 1025, in _get_features_and_labels_from_input_fn
self._call_input_fn(input_fn, mode))
File "tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/util.py", line 65, in parse_input_fn_result
result = iterator.get_next()
File "tensorflow-1.15.2/python3.6/tensorflow_core/python/data/ops/iterator_ops.py", line 426, in get_next
name=name)
File "tensorflow-1.15.2/python3.6/tensorflow_core/python/ops/gen_dataset_ops.py", line 2518, in iterator_get_next
output_shapes=output_shapes, name=name)
File "tensorflow-1.15.2/python3.6/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
op_def=op_def)
File "tensorflow-1.15.2/python3.6/tensorflow_core/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "tensorflow-1.15.2/python3.6/tensorflow_core/python/framework/ops.py", line 3357, in create_op
attrs, op_def, compute_device)
File "tensorflow-1.15.2/python3.6/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
op_def=op_def)
File "tensorflow-1.15.2/python3.6/tensorflow_core/python/framework/ops.py", line 1748, in __init__
self._traceback = tf_stack.extract_stack()
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/client/session.py", line 1350, in _run_fn
target_list, run_metadata)
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: TypeError: object of type <class 'numpy.float64'> cannot be safely interpreted as an integer.
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/numpy/core/function_base.py", line 117, in linspace
num = operator.index(num)
TypeError: 'numpy.float64' object cannot be interpreted as an integer
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/ops/script_ops.py", line 235, in __call__
ret = func(*args)
File "/content/models/research/object_detection/metrics/coco_evaluation.py", line 416, in first_value_func
self._metrics = self.evaluate()
File "/content/models/research/object_detection/metrics/coco_evaluation.py", line 247, in evaluate
coco_wrapped_groundtruth, coco_wrapped_detections, agnostic_mode=False)
File "/content/models/research/object_detection/metrics/coco_tools.py", line 178, in __init__
cocoeval.COCOeval.__init__(self, groundtruth, detections, iouType=iou_type)
File "/usr/local/lib/python3.6/dist-packages/pycocotools/cocoeval.py", line 76, in __init__
self.params = Params(iouType=iouType) # parameters
File "/usr/local/lib/python3.6/dist-packages/pycocotools/cocoeval.py", line 527, in __init__
self.setDetParams()
File "/usr/local/lib/python3.6/dist-packages/pycocotools/cocoeval.py", line 507, in setDetParams
self.iouThrs = np.linspace(.5, 0.95, np.round((0.95 - .5) / .05) + 1, endpoint=True)
File "<__array_function__ internals>", line 6, in linspace
File "/usr/local/lib/python3.6/dist-packages/numpy/core/function_base.py", line 121, in linspace
.format(type(num)))
TypeError: object of type <class 'numpy.float64'> cannot be safely interpreted as an integer.
[[{{node PyFunc_3}}]]
[[cond/Detections_Left_Groundtruth_Right/0/_4927]]
(1) Invalid argument: TypeError: object of type <class 'numpy.float64'> cannot be safely interpreted as an integer.
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/numpy/core/function_base.py", line 117, in linspace
num = operator.index(num)
TypeError: 'numpy.float64' object cannot be interpreted as an integer
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/ops/script_ops.py", line 235, in __call__
ret = func(*args)
File "/content/models/research/object_detection/metrics/coco_evaluation.py", line 416, in first_value_func
self._metrics = self.evaluate()
File "/content/models/research/object_detection/metrics/coco_evaluation.py", line 247, in evaluate
coco_wrapped_groundtruth, coco_wrapped_detections, agnostic_mode=False)
File "/content/models/research/object_detection/metrics/coco_tools.py", line 178, in __init__
cocoeval.COCOeval.__init__(self, groundtruth, detections, iouType=iou_type)
File "/usr/local/lib/python3.6/dist-packages/pycocotools/cocoeval.py", line 76, in __init__
self.params = Params(iouType=iouType) # parameters
File "/usr/local/lib/python3.6/dist-packages/pycocotools/cocoeval.py", line 527, in __init__
self.setDetParams()
File "/usr/local/lib/python3.6/dist-packages/pycocotools/cocoeval.py", line 507, in setDetParams
self.iouThrs = np.linspace(.5, 0.95, np.round((0.95 - .5) / .05) + 1, endpoint=True)
File "<__array_function__ internals>", line 6, in linspace
File "/usr/local/lib/python3.6/dist-packages/numpy/core/function_base.py", line 121, in linspace
.format(type(num)))
TypeError: object of type <class 'numpy.float64'> cannot be safely interpreted as an integer.
[[{{node PyFunc_3}}]]
0 successful operations.
0 derived errors ignored.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/content/models/research/object_detection/model_main.py", line 114, in <module>
tf.app.run()
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "/usr/local/lib/python3.6/dist-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "/content/models/research/object_detection/model_main.py", line 110, in main
tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0])
File "/tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/training.py", line 473, in train_and_evaluate
return executor.run()
File "/tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/training.py", line 613, in run
return self.run_local()
File "/tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/training.py", line 714, in run_local
saving_listeners=saving_listeners)
File "/tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/estimator.py", line 1195, in _train_model_default
saving_listeners)
File "/tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/estimator.py", line 1494, in _train_with_estimator_spec
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/training/monitored_session.py", line 754, in run
run_metadata=run_metadata)
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/training/monitored_session.py", line 1259, in run
run_metadata=run_metadata)
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/training/monitored_session.py", line 1360, in run
raise six.reraise(*original_exc_info)
File "/usr/local/lib/python3.6/dist-packages/six.py", line 693, in reraise
raise value
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/training/monitored_session.py", line 1345, in run
return self._sess.run(*args, **kwargs)
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/training/monitored_session.py", line 1426, in run
run_metadata=run_metadata))
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/training/basic_session_run_hooks.py", line 594, in after_run
if self._save(run_context.session, global_step):
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/training/basic_session_run_hooks.py", line 619, in _save
if l.after_save(session, step):
File "/tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/training.py", line 519, in after_save
self._evaluate(global_step_value) # updates self.eval_result
File "/tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/training.py", line 539, in _evaluate
self._evaluator.evaluate_and_export())
File "/tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/training.py", line 920, in evaluate_and_export
hooks=self._eval_spec.hooks)
File "/tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/estimator.py", line 480, in evaluate
name=name)
File "/tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/estimator.py", line 522, in _actual_eval
return _evaluate()
File "/tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/estimator.py", line 511, in _evaluate
output_dir=self.eval_dir(name))
File "/tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/estimator.py", line 1619, in _evaluate_run
config=self._session_config)
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/training/evaluation.py", line 272, in _evaluate_once
session.run(eval_ops, feed_dict)
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/training/monitored_session.py", line 861, in __exit__
self._close_internal(exception_type)
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/training/monitored_session.py", line 894, in _close_internal
h.end(self._coordinated_creator.tf_sess)
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/training/basic_session_run_hooks.py", line 951, in end
self._final_ops, feed_dict=self._final_ops_feed_dict)
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/client/session.py", line 956, in run
run_metadata_ptr)
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/client/session.py", line 1180, in _run
feed_dict_tensor, options, run_metadata)
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/client/session.py", line 1359, in _do_run
run_metadata)
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: TypeError: object of type <class 'numpy.float64'> cannot be safely interpreted as an integer.
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/numpy/core/function_base.py", line 117, in linspace
num = operator.index(num)
TypeError: 'numpy.float64' object cannot be interpreted as an integer
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/ops/script_ops.py", line 235, in __call__
ret = func(*args)
File "/content/models/research/object_detection/metrics/coco_evaluation.py", line 416, in first_value_func
self._metrics = self.evaluate()
File "/content/models/research/object_detection/metrics/coco_evaluation.py", line 247, in evaluate
coco_wrapped_groundtruth, coco_wrapped_detections, agnostic_mode=False)
File "/content/models/research/object_detection/metrics/coco_tools.py", line 178, in __init__
cocoeval.COCOeval.__init__(self, groundtruth, detections, iouType=iou_type)
File "/usr/local/lib/python3.6/dist-packages/pycocotools/cocoeval.py", line 76, in __init__
self.params = Params(iouType=iouType) # parameters
File "/usr/local/lib/python3.6/dist-packages/pycocotools/cocoeval.py", line 527, in __init__
self.setDetParams()
File "/usr/local/lib/python3.6/dist-packages/pycocotools/cocoeval.py", line 507, in setDetParams
self.iouThrs = np.linspace(.5, 0.95, np.round((0.95 - .5) / .05) + 1, endpoint=True)
File "<__array_function__ internals>", line 6, in linspace
File "/usr/local/lib/python3.6/dist-packages/numpy/core/function_base.py", line 121, in linspace
.format(type(num)))
TypeError: object of type <class 'numpy.float64'> cannot be safely interpreted as an integer.
[[node PyFunc_3 (defined at tensorflow-1.15.2/python3.6/tensorflow_core/python/framework/ops.py:1748) ]]
[[cond/Detections_Left_Groundtruth_Right/0/_4927]]
(1) Invalid argument: TypeError: object of type <class 'numpy.float64'> cannot be safely interpreted as an integer.
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/numpy/core/function_base.py", line 117, in linspace
num = operator.index(num)
TypeError: 'numpy.float64' object cannot be interpreted as an integer
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/tensorflow-1.15.2/python3.6/tensorflow_core/python/ops/script_ops.py", line 235, in __call__
ret = func(*args)
File "/content/models/research/object_detection/metrics/coco_evaluation.py", line 416, in first_value_func
self._metrics = self.evaluate()
File "/content/models/research/object_detection/metrics/coco_evaluation.py", line 247, in evaluate
coco_wrapped_groundtruth, coco_wrapped_detections, agnostic_mode=False)
File "/content/models/research/object_detection/metrics/coco_tools.py", line 178, in __init__
cocoeval.COCOeval.__init__(self, groundtruth, detections, iouType=iou_type)
File "/usr/local/lib/python3.6/dist-packages/pycocotools/cocoeval.py", line 76, in __init__
self.params = Params(iouType=iouType) # parameters
File "/usr/local/lib/python3.6/dist-packages/pycocotools/cocoeval.py", line 527, in __init__
self.setDetParams()
File "/usr/local/lib/python3.6/dist-packages/pycocotools/cocoeval.py", line 507, in setDetParams
self.iouThrs = np.linspace(.5, 0.95, np.round((0.95 - .5) / .05) + 1, endpoint=True)
File "<__array_function__ internals>", line 6, in linspace
File "/usr/local/lib/python3.6/dist-packages/numpy/core/function_base.py", line 121, in linspace
.format(type(num)))
TypeError: object of type <class 'numpy.float64'> cannot be safely interpreted as an integer.
[[node PyFunc_3 (defined at tensorflow-1.15.2/python3.6/tensorflow_core/python/framework/ops.py:1748) ]]
0 successful operations.
0 derived errors ignored.
Original stack trace for 'PyFunc_3':
File "content/models/research/object_detection/model_main.py", line 114, in <module>
tf.app.run()
File "tensorflow-1.15.2/python3.6/tensorflow_core/python/platform/app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "usr/local/lib/python3.6/dist-packages/absl/app.py", line 299, in run
_run_main(main, args)
File "usr/local/lib/python3.6/dist-packages/absl/app.py", line 250, in _run_main
sys.exit(main(argv))
File "content/models/research/object_detection/model_main.py", line 110, in main
tf.estimator.train_and_evaluate(estimator, train_spec, eval_specs[0])
File "tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/training.py", line 473, in train_and_evaluate
return executor.run()
File "tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/training.py", line 613, in run
return self.run_local()
File "tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/training.py", line 714, in run_local
saving_listeners=saving_listeners)
File "tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/estimator.py", line 370, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/estimator.py", line 1161, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/estimator.py", line 1195, in _train_model_default
saving_listeners)
File "tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/estimator.py", line 1494, in _train_with_estimator_spec
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File "tensorflow-1.15.2/python3.6/tensorflow_core/python/training/monitored_session.py", line 754, in run
run_metadata=run_metadata)
File "tensorflow-1.15.2/python3.6/tensorflow_core/python/training/monitored_session.py", line 1259, in run
run_metadata=run_metadata)
File "tensorflow-1.15.2/python3.6/tensorflow_core/python/training/monitored_session.py", line 1345, in run
return self._sess.run(*args, **kwargs)
File "tensorflow-1.15.2/python3.6/tensorflow_core/python/training/monitored_session.py", line 1426, in run
run_metadata=run_metadata))
File "tensorflow-1.15.2/python3.6/tensorflow_core/python/training/basic_session_run_hooks.py", line 594, in after_run
if self._save(run_context.session, global_step):
File "tensorflow-1.15.2/python3.6/tensorflow_core/python/training/basic_session_run_hooks.py", line 619, in _save
if l.after_save(session, step):
File "tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/training.py", line 519, in after_save
self._evaluate(global_step_value) # updates self.eval_result
File "tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/training.py", line 539, in _evaluate
self._evaluator.evaluate_and_export())
File "tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/training.py", line 920, in evaluate_and_export
hooks=self._eval_spec.hooks)
File "tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/estimator.py", line 480, in evaluate
name=name)
File "tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/estimator.py", line 522, in _actual_eval
return _evaluate()
File "tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/estimator.py", line 504, in _evaluate
self._evaluate_build_graph(input_fn, hooks, checkpoint_path))
File "tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/estimator.py", line 1511, in _evaluate_build_graph
self._call_model_fn_eval(input_fn, self.config))
File "tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/estimator.py", line 1547, in _call_model_fn_eval
features, labels, ModeKeys.EVAL, config)
File "tensorflow-1.15.2/python3.6/tensorflow_estimator/python/estimator/estimator.py", line 1149, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "content/models/research/object_detection/model_lib.py", line 570, in model_fn
eval_config, list(category_index.values()), eval_dict)
File "content/models/research/object_detection/eval_util.py", line 1045, in get_eval_metric_ops_for_evaluators
eval_dict))
File "content/models/research/object_detection/metrics/coco_evaluation.py", line 426, in get_estimator_eval_metric_ops
first_value_op = tf.py_func(first_value_func, [], tf.float32)
File "tensorflow-1.15.2/python3.6/tensorflow_core/python/util/deprecation.py", line 324, in new_func
return func(*args, **kwargs)
File "tensorflow-1.15.2/python3.6/tensorflow_core/python/ops/script_ops.py", line 513, in py_func
return py_func_common(func, inp, Tout, stateful, name=name)
File "tensorflow-1.15.2/python3.6/tensorflow_core/python/ops/script_ops.py", line 495, in py_func_common
func=func, inp=inp, Tout=Tout, stateful=stateful, eager=False, name=name)
File "tensorflow-1.15.2/python3.6/tensorflow_core/python/ops/script_ops.py", line 318, in _internal_py_func
input=inp, token=token, Tout=Tout, name=name)
File "tensorflow-1.15.2/python3.6/tensorflow_core/python/ops/gen_script_ops.py", line 170, in py_func
"PyFunc", input=input, token=token, Tout=Tout, name=name)
File "tensorflow-1.15.2/python3.6/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
op_def=op_def)
File "tensorflow-1.15.2/python3.6/tensorflow_core/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "tensorflow-1.15.2/python3.6/tensorflow_core/python/framework/ops.py", line 3357, in create_op
attrs, op_def, compute_device)
File "tensorflow-1.15.2/python3.6/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
op_def=op_def)
File "tensorflow-1.15.2/python3.6/tensorflow_core/python/framework/ops.py", line 1748, in __init__
self._traceback = tf_stack.extract_stack()
What are the possibilites of getting over this issue, any kind of recommendations?

Try adding these lines of code immediately after importing Tensorflow in your train.py
from tensorflow.compat.v1 import ConfigProto
from tensorflow.compat.v1 import InteractiveSession
config = ConfigProto()
config.gpu_options.allow_growth = True
session = InteractiveSession(config=config)

Restoring saved model in Tensorflow v1.14

I am using tensorflow v1.14. I have a saved model and I'm trying to restore the model using the following code:
loader = tf.train.import_meta_graph("models/fcnn0/model.ckpt.meta")
graph = tf.get_default_graph()
sess = tf.Session()
loader.restore(sess, "models/fcnn0/model.ckpt")
I used to use the same piece of code in Tensorflow v1.13 and it used to work without errors. But now I'm getting the error
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1356, in _do_call
return fn(*args)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1341, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1429, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.DataLossError: file is too short to be an sstable
[[{{node save/RestoreV2}}]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/sandesh/PycharmProjects/fading/finding_code/src/load_32_64.py", line 8, in <module>
loader.restore(sess, "models/fcnn_32_64_aenc_1331_747_3870000/model.ckpt")
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 1286, in restore
{self.saver_def.filename_tensor_name: save_path})
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 950, in run
run_metadata_ptr)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1173, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1350, in _do_run
run_metadata)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1370, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.DataLossError: file is too short to be an sstable
[[node save/RestoreV2 (defined at home/sandesh/PycharmProjects/fading/finding_code/src/load_32_64.py:5) ]]
Original stack trace for 'save/RestoreV2':
File "home/sandesh/PycharmProjects/fading/finding_code/src/load_32_64.py", line 5, in <module>
loader = tf.train.import_meta_graph("models/fcnn_32_64_aenc0/model.ckpt.meta")
File "usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 1449, in import_meta_graph
**kwargs)[0]
File "usr/local/lib/python3.6/dist-packages/tensorflow/python/training/saver.py", line 1473, in _import_meta_graph_with_return_elements
**kwargs))
File "usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/meta_graph.py", line 857, in import_scoped_meta_graph_with_return_elements
return_elements=return_elements)
File "usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/importer.py", line 443, in import_graph_def
_ProcessNewOps(graph)
File "usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/importer.py", line 236, in _ProcessNewOps
for new_op in graph._add_new_tf_operations(compute_devices=False): # pylint: disable=protected-access
File "usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 3751, in _add_new_tf_operations
for c_op in c_api_util.new_tf_operations(self)
File "usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 3751, in <listcomp>
for c_op in c_api_util.new_tf_operations(self)
File "usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 3641, in _create_op_from_tf_operation
ret = Operation(c_op, self)
File "usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 2005, in __init__
self._traceback = tf_stack.extract_stack()
Can someone point me as to what I'm doing wrong? Thanks in advance.

I was looking into the folder where the model files were saved and found that the model.ckpt.meta file had not been written to disk properly. I reran the training and saved the model and then it worked perfectly.

Do I have to restart colaboratory runtime every time?

I cannot run my code in Google Colaboratory twice without restarting runtime. Is there a way to run it without restarting runtime.
My code takes TensorD libraries and compute aproximation of a random 2x4x4 tensor using CP ALS algorithm. (this is an example taken from https://github.com/Large-Scale-Tensor-Decomposition/tensorD)
!git clone https://github.com/Large-Scale-Tensor-Decomposition/tensorD.git
import sys
import time
sys.path.append("/content/tensorD")
from tensorD.factorization.env import Environment
from tensorD.dataproc.provider import Provider
from tensorD.demo.DataGenerator import *
from tensorD.factorization.cp import CP_ALS
# generate a random tensor with shape 3x4x4
t = time.time()
X = synthetic_data_cp([3, 4, 4], 7)
data_provider = Provider()
data_provider.full_tensor = lambda: X
env = Environment(data_provider, summary_path='/tmp/cp_' + '7')
cp = CP_ALS(env)
args = CP_ALS.CP_Args(rank=7, validation_internal=1)
# build CP model with arguments
cp.build_model(args)
# train CP model with the maximum iteration of 100
cp.train(50)
# obtain factor matrices from trained model
factor_matrices = cp.factors
# obtain scaling vector from trained model
lambdas = cp.lambdas
for matrix in factor_matrices:
print(matrix)
elapsed = time.time() - t
print(elapsed)
when I run it first time I have no problem. When I run it again (without restart of runtime) I obtain:
CP model initial finish
---------------------------------------------------------------------------
InvalidArgumentError Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args)
1333 try:
-> 1334 return fn(*args)
1335 except errors.OpError as e:
7 frames
InvalidArgumentError: You must feed a value for placeholder tensor 'Placeholder' with dtype float and shape [3,4,4]
[[{{node Placeholder}}]]
During handling of the above exception, another exception occurred:
InvalidArgumentError Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args)
1346 pass
1347 message = error_interpolation.interpolate(message, self._graph)
-> 1348 raise type(e)(node_def, op, message)
1349
1350 def _extend_graph(self):
InvalidArgumentError: You must feed a value for placeholder tensor 'Placeholder' with dtype float and shape [3,4,4]
[[node Placeholder (defined at /content/tensorD/tensorD/factorization/cp.py:69) ]]
Caused by op 'Placeholder', defined at:
File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py", line 16, in <module>
app.launch_new_instance()
File "/usr/local/lib/python3.6/dist-packages/traitlets/config/application.py", line 658, in launch_instance
app.start()
File "/usr/local/lib/python3.6/dist-packages/ipykernel/kernelapp.py", line 477, in start
ioloop.IOLoop.instance().start()
File "/usr/local/lib/python3.6/dist-packages/tornado/ioloop.py", line 888, in start
handler_func(fd_obj, events)
File "/usr/local/lib/python3.6/dist-packages/tornado/stack_context.py", line 277, in null_wrapper
return fn(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/zmq/eventloop/zmqstream.py", line 450, in _handle_events
self._handle_recv()
File "/usr/local/lib/python3.6/dist-packages/zmq/eventloop/zmqstream.py", line 480, in _handle_recv
self._run_callback(callback, msg)
File "/usr/local/lib/python3.6/dist-packages/zmq/eventloop/zmqstream.py", line 432, in _run_callback
callback(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tornado/stack_context.py", line 277, in null_wrapper
return fn(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/ipykernel/kernelbase.py", line 283, in dispatcher
return self.dispatch_shell(stream, msg)
File "/usr/local/lib/python3.6/dist-packages/ipykernel/kernelbase.py", line 235, in dispatch_shell
handler(stream, idents, msg)
File "/usr/local/lib/python3.6/dist-packages/ipykernel/kernelbase.py", line 399, in execute_request
user_expressions, allow_stdin)
File "/usr/local/lib/python3.6/dist-packages/ipykernel/ipkernel.py", line 196, in do_execute
res = shell.run_cell(code, store_history=store_history, silent=silent)
File "/usr/local/lib/python3.6/dist-packages/ipykernel/zmqshell.py", line 533, in run_cell
return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py", line 2718, in run_cell
interactivity=interactivity, compiler=compiler, result=result)
File "/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py", line 2822, in run_ast_nodes
if self.run_code(code, result):
File "/usr/local/lib/python3.6/dist-packages/IPython/core/interactiveshell.py", line 2882, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-4-4d2250ec007b>", line 21, in <module>
cp.build_model(args)
File "/content/tensorD/tensorD/factorization/cp.py", line 69, in build_model
input_data = tf.placeholder(tf.float32, shape=self._env.full_shape())
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/array_ops.py", line 2077, in placeholder
return gen_array_ops.placeholder(dtype=dtype, shape=shape, name=name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 5791, in placeholder
"Placeholder", dtype=dtype, shape=shape, name=name)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
op_def=op_def)
File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
self._traceback = tf_stack.extract_stack()
InvalidArgumentError (see above for traceback): You must feed a value for placeholder tensor 'Placeholder' with dtype float and shape [3,4,4]
[[node Placeholder (defined at /content/tensorD/tensorD/factorization/cp.py:69) ]]
Any help will be appreciated!

This seems to mostly be a question about TensorFlow. Do you get what you want outside of Colab? I'm not totally clear about what you are expecting, but import tensorflow as tf; tf.reset_default_graph() at the top of your snippet seems sensible and squelches the error.

Tensorflow: sess.run(x) not working. InvalidArgumentError: Cannot assign a device for operation 'MatMul': Operation was assigned to /device:GPU:1

I'm using python 3.6(Anaconda) on windows-64bit PC. TensorFlow version that I'm using is TensorFlow-1.2.1. I'm running following simple code in my PC.
import tensorflow as tf
sess = tf.Session()
x1 = tf.constant(5)
x2 = tf.constant(6)
# runs result
print(sess.run(x1))
It is giving me following error.:
Traceback (most recent call last):
File "<ipython-input-64-f7e8ea564f81>", line 7, in <module>
print(sess.run(x1))
File "C:\Users\POPEYE.SAILOR\AppData\Local\Continuum\Anaconda3\lib\site-packages\tensorflow\python\client\session.py", line 789, in run
run_metadata_ptr)
File "C:\Users\POPEYE.SAILOR\AppData\Local\Continuum\Anaconda3\lib\site-packages\tensorflow\python\client\session.py", line 997, in _run
feed_dict_string, options, run_metadata)
File "C:\Users\POPEYE.SAILOR\AppData\Local\Continuum\Anaconda3\lib\site-packages\tensorflow\python\client\session.py", line 1132, in _do_run
target_list, options, run_metadata)
File "C:\Users\POPEYE.SAILOR\AppData\Local\Continuum\Anaconda3\lib\site-packages\tensorflow\python\client\session.py", line 1152, in _do_call
raise type(e)(node_def, op, message)
InvalidArgumentError: Cannot assign a device for operation 'MatMul': Operation was explicitly assigned to /device:GPU:1 but available devices are [ /job:localhost/replica:0/task:0/cpu:0 ]. Make sure the device specification refers to a valid device.
[[Node: MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/device:GPU:1"](Const_2, Const_3)]]
Caused by op 'MatMul', defined at:
File "C:\Users\POPEYE.SAILOR\AppData\Local\Continuum\Anaconda3\lib\site-packages\spyder\utils\ipython\start_kernel.py", line 227, in <module>
main()
File "C:\Users\POPEYE.SAILOR\AppData\Local\Continuum\Anaconda3\lib\site-packages\spyder\utils\ipython\start_kernel.py", line 223, in main
kernel.start()
File "C:\Users\POPEYE.SAILOR\AppData\Local\Continuum\Anaconda3\lib\site-packages\ipykernel\kernelapp.py", line 474, in start
ioloop.IOLoop.instance().start()
File "C:\Users\POPEYE.SAILOR\AppData\Local\Continuum\Anaconda3\lib\site-packages\zmq\eventloop\ioloop.py", line 177, in start
super(ZMQIOLoop, self).start()
File "C:\Users\POPEYE.SAILOR\AppData\Local\Continuum\Anaconda3\lib\site-packages\tornado\ioloop.py", line 887, in start
handler_func(fd_obj, events)
File "C:\Users\POPEYE.SAILOR\AppData\Local\Continuum\Anaconda3\lib\site-packages\tornado\stack_context.py", line 275, in null_wrapper
return fn(*args, **kwargs)
File "C:\Users\POPEYE.SAILOR\AppData\Local\Continuum\Anaconda3\lib\site-packages\zmq\eventloop\zmqstream.py", line 440, in _handle_events
self._handle_recv()
File "C:\Users\POPEYE.SAILOR\AppData\Local\Continuum\Anaconda3\lib\site-packages\zmq\eventloop\zmqstream.py", line 472, in _handle_recv
self._run_callback(callback, msg)
File "C:\Users\POPEYE.SAILOR\AppData\Local\Continuum\Anaconda3\lib\site-packages\zmq\eventloop\zmqstream.py", line 414, in _run_callback
callback(*args, **kwargs)
File "C:\Users\POPEYE.SAILOR\AppData\Local\Continuum\Anaconda3\lib\site-packages\tornado\stack_context.py", line 275, in null_wrapper
return fn(*args, **kwargs)
File "C:\Users\POPEYE.SAILOR\AppData\Local\Continuum\Anaconda3\lib\site-packages\ipykernel\kernelbase.py", line 276, in dispatcher
return self.dispatch_shell(stream, msg)
File "C:\Users\POPEYE.SAILOR\AppData\Local\Continuum\Anaconda3\lib\site-packages\ipykernel\kernelbase.py", line 228, in dispatch_shell
handler(stream, idents, msg)
File "C:\Users\POPEYE.SAILOR\AppData\Local\Continuum\Anaconda3\lib\site-packages\ipykernel\kernelbase.py", line 390, in execute_request
user_expressions, allow_stdin)
File "C:\Users\POPEYE.SAILOR\AppData\Local\Continuum\Anaconda3\lib\site-packages\ipykernel\ipkernel.py", line 196, in do_execute
res = shell.run_cell(code, store_history=store_history, silent=silent)
File "C:\Users\POPEYE.SAILOR\AppData\Local\Continuum\Anaconda3\lib\site-packages\ipykernel\zmqshell.py", line 501, in run_cell
return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
File "C:\Users\POPEYE.SAILOR\AppData\Local\Continuum\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2717, in run_cell
interactivity=interactivity, compiler=compiler, result=result)
File "C:\Users\POPEYE.SAILOR\AppData\Local\Continuum\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2821, in run_ast_nodes
if self.run_code(code, result):
File "C:\Users\POPEYE.SAILOR\AppData\Local\Continuum\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2881, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-18-02c5e13ac58a>", line 5, in <module>
product = tf.matmul(matrix1, matrix2)
File "C:\Users\POPEYE.SAILOR\AppData\Local\Continuum\Anaconda3\lib\site-packages\tensorflow\python\ops\math_ops.py", line 1816, in matmul
a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
File "C:\Users\POPEYE.SAILOR\AppData\Local\Continuum\Anaconda3\lib\site-packages\tensorflow\python\ops\gen_math_ops.py", line 1217, in _mat_mul
transpose_b=transpose_b, name=name)
File "C:\Users\POPEYE.SAILOR\AppData\Local\Continuum\Anaconda3\lib\site-packages\tensorflow\python\framework\op_def_library.py", line 767, in apply_op
op_def=op_def)
File "C:\Users\POPEYE.SAILOR\AppData\Local\Continuum\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py", line 2506, in create_op
original_op=self._default_original_op, op_def=op_def)
File "C:\Users\POPEYE.SAILOR\AppData\Local\Continuum\Anaconda3\lib\site-packages\tensorflow\python\framework\ops.py", line 1269, in __init__
self._traceback = _extract_stack()
InvalidArgumentError (see above for traceback): Cannot assign a device for operation 'MatMul': Operation was explicitly assigned to /device:GPU:1 but available devices are [ /job:localhost/replica:0/task:0/cpu:0 ]. Make sure the device specification refers to a valid device.
[[Node: MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/device:GPU:1"](Const_2, Const_3)]]
Prior to this it was running just fine. I could run these codes but suddenly is has started showing the above error. I have not made any changes in anaconda environment nor have installed any other package.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

ResourceException when use GPU but not CPU on Azure - azure

Host memory is quite a bit larger that device memory for an N-series machine. Are you sure you simply aren't exceeding the device capacity?

Related

Four model training simultaneously in a single machine with 8 gpus

self._traceback = tf_stack.extract_stack()

Restoring saved model in Tensorflow v1.14

Do I have to restart colaboratory runtime every time?

Tensorflow: sess.run(x) not working. InvalidArgumentError: Cannot assign a device for operation 'MatMul': Operation was assigned to /device:GPU:1

Categories

Resources