TensorFlow object detection error when training - object
Hej Guys !
I am trying to run locally the cat example and I got stuck on the trainning step. I get this very long error. Could someone help me out to understand what is wrong ?
Thanks in advance.
The command:
bertalan#mbqs:~/tensorflow/models$ python object_detection/train.py --logtostderr --pipeline_config_path=/home/bertalan/tensorflow/models/object_detection/samples/configs/Myfaster_rcnn_resnet101_pets.config --train_dir=TrainCat
Here the output with error:
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Summary name Learning Rate is illegal; using Learning_Rate instead.
INFO:tensorflow:Summary name /clone_loss is illegal; using clone_loss instead.
/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gradients_impl.py:93: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
"Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
2017-07-26 11:27:18.416172: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-26 11:27:18.416220: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-26 11:27:18.416248: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-07-26 11:27:20.437921: I tensorflow/core/common_runtime/simple_placer.cc:669] Ignoring device specification /device:GPU:0 for node 'prefetch_queue_Dequeue' because the input edge from 'prefetch_queue' is a reference connection and already has a device field set to /device:CPU:0
INFO:tensorflow:Restoring parameters from /home/bertalan/tensorflow/models/faster_rcnn_inception_resnet_v2_atrous_coco_11_06_2017/model.ckpt
2017-07-26 11:27:25.179612: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block4/unit_1/bottleneck_v1/conv3/BatchNorm/beta not found in checkpoint
2017-07-26 11:27:25.179639: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block4/unit_1/bottleneck_v1/conv2/weights not found in checkpoint
2017-07-26 11:27:25.179673: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block4/unit_1/bottleneck_v1/conv3/BatchNorm/moving_mean not found in checkpoint
2017-07-26 11:27:25.179612: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block4/unit_1/bottleneck_v1/conv3/BatchNorm/gamma not found in checkpoint
2017-07-26 11:27:25.185127: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block4/unit_1/bottleneck_v1/conv3/BatchNorm/moving_variance not found in checkpoint
2017-07-26 11:27:25.187191: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block4/unit_1/bottleneck_v1/conv3/weights not found in checkpoint
2017-07-26 11:27:25.187614: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block4/unit_1/bottleneck_v1/shortcut/BatchNorm/gamma not found in checkpoint
2017-07-26 11:27:25.188036: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block4/unit_1/bottleneck_v1/shortcut/BatchNorm/beta not found in checkpoint
2017-07-26 11:27:25.188324: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block4/unit_1/bottleneck_v1/conv2/BatchNorm/moving_variance not found in checkpoint
2017-07-26 11:27:25.189131: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block4/unit_1/bottleneck_v1/shortcut/weights not found in checkpoint
2017-07-26 11:27:25.190319: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block4/unit_1/bottleneck_v1/shortcut/BatchNorm/moving_variance not found in checkpoint
2017-07-26 11:27:25.190613: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block4/unit_1/bottleneck_v1/shortcut/BatchNorm/moving_mean not found in checkpoint
2017-07-26 11:27:25.190923: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block4/unit_1/bottleneck_v1/conv2/BatchNorm/moving_mean not found in checkpoint
2017-07-26 11:27:25.191949: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block4/unit_2/bottleneck_v1/conv1/BatchNorm/moving_mean not found in checkpoint
2017-07-26 11:27:25.192728: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block4/unit_2/bottleneck_v1/conv1/BatchNorm/gamma not found in checkpoint
2017-07-26 11:27:25.193354: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block4/unit_2/bottleneck_v1/conv1/BatchNorm/moving_variance not found in checkpoint
2017-07-26 11:27:25.194102: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key
...
2017-07-26 11:27:25.204869: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block4/unit_3/bottleneck_v1/conv1/BatchNorm/gamma not found in checkpoint
2017-07-26 11:27:25.205198: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block4/unit_3/bottleneck_v1/conv1/BatchNorm/moving_mean not found in checkpoint
2017-07-26 11:27:25.205799: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block4/unit_1/bottleneck_v1/conv1/BatchNorm/moving_mean not found in checkpoint
2017-07-26 11:27:25.205853: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block4/unit_3/bottleneck_v1/conv1/BatchNorm/moving_variance not found in checkpoint
2017-07-26 11:27:25.209234: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block4/unit_3/bottleneck_v1/conv2/BatchNorm/beta not found in checkpoint
2017-07-26 11:27:25.210446: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block4/unit_3/bottleneck_v1/conv1/weights not found in checkpoint
2017-07-26 11:27:25.210829: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block4/unit_1/bottleneck_v1/conv1/BatchNorm/gamma not found in checkpoint
2017-07-26 11:27:25.212274: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block4/unit_3/bottleneck_v1/conv2/BatchNorm/gamma not found in checkpoint
2017-07-26 11:27:25.212305: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block4/unit_3/bottleneck_v1/conv2/BatchNorm/moving_mean not found in checkpoint Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block3/unit_20/bottleneck_v1/conv2/BatchNorm/moving_variance not found in checkpoint
2017-07-26 11:27:25.613441: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block3/unit_20/bottleneck_v1/conv3/BatchNorm/beta not found in checkpoint
2017-07-26 11:27:25.613721: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block3/unit_20/bottleneck_v1/conv3/BatchNorm/gamma not found in checkpoint
2017-07-26 11:27:25.615790: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block3/unit_20/bottleneck_v1/conv3/weights not found in checkpoint
2017-07-26 11:27:25.615937: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block3/unit_20/bottleneck_v1/conv3/BatchNorm/moving_mean not found in checkpoint
2017-07-26 11:27:25.616601: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block3/unit_20/bottleneck_v1/conv3/BatchNorm/moving_variance not found in checkpoint
2017-07-26 11:27:25.616872: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block3/unit_21/bottleneck_v1/conv1/BatchNorm/beta not found in checkpoint
2017-07-26 11:27:25.617185: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block3/unit_22/bottleneck_v1/conv2/BatchNorm/moving_variance not found in checkpoint
2017-07-26 11:27:25.617505: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block3/unit_21/bottleneck_v1/conv1/BatchNorm/gamma not found in checkpoint
2017-07-26 11:27:25.618701: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block3/unit_21/bottleneck_v1/conv1/BatchNorm/moving_mean not found in checkpoint
2017-07-26 11:27:25.618781: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block3/unit_22/bottleneck_v1/conv2/BatchNorm/moving_mean not found in checkpoint
2017-07-26 11:27:25.620022: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block3/unit_21/bottleneck_v1/conv1/weights not found in checkpoint
2017-07-26 11:27:25.621149: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block3/unit_21/bottleneck_v1/conv2/BatchNorm/beta not found in checkpoint
2017-07-26 11:27:25.621225: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block3/unit_22/bottleneck_v1/conv2/BatchNorm/gamma not found in checkpoint
2017-07-26 11:27:25.621225: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block3/unit_21/bottleneck_v1/conv1/BatchNorm/moving_variance not found in checkpoint
2017-07-26 11:27:25.623092: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block3/unit_21/bottleneck_v1/conv2/BatchNorm/gamma not found in checkpoint
2017-07-26 11:27:25.624135: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block3/unit_21/bottleneck_v1/conv2/BatchNorm/moving_variance not found in checkpoint
2017-07-26 11:27:25.627327: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block3/unit_21/bottleneck_v1/conv2/BatchNorm/moving_mean not found in checkpoint
2017-07-26 11:27:25.627572: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block3/unit_21/bottleneck_v1/conv2/weights not found in checkpoint
2017-07-26 11:27:25.628414: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block3/unit_21/bottleneck_v1/conv3/BatchNorm/gamma not found in checkpoint
2017-07-26 11:27:25.628844: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block3/unit_21/bottleneck_v1/conv3/BatchNorm/moving_mean not found in checkpoint
2017-07-26 11:27:25.629118: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block3/unit_22/bottleneck_v1/conv2/BatchNorm/beta not found in checkpoint
2017-07-26 11:27:25.629480: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block3/unit_21/bottleneck_v1/conv3/BatchNorm/beta not found in checkpoint
2017-07-26 11:27:25.629624: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block3/unit_21/bottleneck_v1/conv3/BatchNorm/moving_variance not found in checkpoint
2017-07-26 11:27:25.630848: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block3/unit_22/bottleneck_v1/conv1/weights not found in checkpoint
2017-07-26 11:27:25.631122: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block3/unit_22/bottleneck_v1/conv1/BatchNorm/beta not found in checkpoint
2017-07-26 11:27:25.631167: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block3/unit_21/bottleneck_v1/conv3/weights not found in checkpoint
2017-07-26 11:27:25.632471: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block3/unit_22/bottleneck_v1/conv1/BatchNorm/gamma not found in checkpoint
2017-07-26 11:27:25.633056: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block3/unit_22/bottleneck_v1/conv1/BatchNorm/moving_variance not found in checkpoint
2017-07-26 11:27:25.633295: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block3/unit_22/bottleneck_v1/conv1/BatchNorm/moving_mean not found in checkpoint
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.NotFoundError'>, Key FirstStageFeatureExtractor/resnet_v1_101/block4/unit_1/bottleneck_v1/conv3/BatchNorm/beta not found in checkpoint
[[Node: save/RestoreV2_475 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_save/Const_0, save/RestoreV2_475/tensor_names, save/RestoreV2_475/shape_and_slices)]]
Caused by op u'save/RestoreV2_475', defined at:
File "object_detection/train.py", line 198, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "object_detection/train.py", line 194, in main
worker_job_name, is_chief, FLAGS.train_dir)
File "/home/bertalan/tensorflow/models/object_detection/trainer.py", line 216, in train
from_detection_checkpoint=train_config.from_detection_checkpoint)
File "/home/bertalan/tensorflow/models/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 1447, in restore_fn
saver = tf.train.Saver(first_stage_variables)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1056, in __init__
self.build()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1086, in build
restore_sequentially=self._restore_sequentially)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 691, in build
restore_sequentially, reshape)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 407, in _AddRestoreOps
tensors = self.restore_op(filename_tensor, saveable, preferred_shard)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 247, in restore_op
[spec.tensor.dtype])[0])
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 669, in restore_v2
dtypes=dtypes, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 768, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2336, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1228, in __init__
self._traceback = _extract_stack()
NotFoundError (see above for traceback): Key FirstStageFeatureExtractor/resnet_v1_101/block4/unit_1/bottleneck_v1/conv3/BatchNorm/beta not found in checkpoint
[[Node: save/RestoreV2_475 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_save/Const_0, save/RestoreV2_475/tensor_names, save/RestoreV2_475/shape_and_slices)]]
Traceback (most recent call last):
File "object_detection/train.py", line 198, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "object_detection/train.py", line 194, in main
worker_job_name, is_chief, FLAGS.train_dir)
File "/home/bertalan/tensorflow/models/object_detection/trainer.py", line 290, in train
saver=saver)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 725, in train
master, start_standard_services=False, config=session_config) as sess:
File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
return self.gen.next()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 960, in managed_session
self.stop(close_summary_writer=close_summary_writer)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 788, in stop
stop_grace_period_secs=self._stop_grace_secs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 389, in join
six.reraise(*self._exc_info_to_raise)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 949, in managed_session
start_standard_services=start_standard_services)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 706, in prepare_or_wait_for_session
init_feed_dict=self._init_feed_dict, init_fn=self._init_fn)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 264, in prepare_session
init_fn(sess)
File "/home/bertalan/tensorflow/models/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 1450, in restore
saver.restore(sess, checkpoint_path)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1457, in restore
{self.saver_def.filename_tensor_name: save_path})
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 778, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 982, in _run
feed_dict_string, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1032, in _do_run
target_list, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1052, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: Key FirstStageFeatureExtractor/resnet_v1_101/block4/unit_1/bottleneck_v1/conv3/BatchNorm/beta not found in checkpoint
[[Node: save/RestoreV2_475 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_save/Const_0, save/RestoreV2_475/tensor_names, save/RestoreV2_475/shape_and_slices)]]
Caused by op u'save/RestoreV2_475', defined at:
File "object_detection/train.py", line 198, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "object_detection/train.py", line 194, in main
worker_job_name, is_chief, FLAGS.train_dir)
File "/home/bertalan/tensorflow/models/object_detection/trainer.py", line 216, in train
from_detection_checkpoint=train_config.from_detection_checkpoint)
File "/home/bertalan/tensorflow/models/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 1447, in restore_fn
saver = tf.train.Saver(first_stage_variables)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1056, in __init__
self.build()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1086, in build
restore_sequentially=self._restore_sequentially)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 691, in build
restore_sequentially, reshape)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 407, in _AddRestoreOps
tensors = self.restore_op(filename_tensor, saveable, preferred_shard)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 247, in restore_op
[spec.tensor.dtype])[0])
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 669, in restore_v2
dtypes=dtypes, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 768, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2336, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1228, in __init__
self._traceback = _extract_stack()
NotFoundError (see above for traceback): Key FirstStageFeatureExtractor/resnet_v1_101/block4/unit_1/bottleneck_v1/conv3/BatchNorm/beta not found in checkpoint
[[Node: save/RestoreV2_475 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_save/Const_0, save/RestoreV2_475/tensor_names, save/RestoreV2_475/shape_and_slices)]]
By the way, here is my config file:
# Faster R-CNN with Resnet-101 (v1) configured for the Oxford-IIIT Pet Dataset.
# Users should configure the fine_tune_checkpoint field in the train config as
# well as the label_map_path and input_path fields in the train_input_reader and
# eval_input_reader. Search for "PATH_TO_BE_CONFIGURED" to find the fields that
# should be configured.
model {
faster_rcnn {
num_classes: 37
image_resizer {
keep_aspect_ratio_resizer {
min_dimension: 600
max_dimension: 1024
}
}
feature_extractor {
type: 'faster_rcnn_resnet101'
first_stage_features_stride: 16
}
first_stage_anchor_generator {
grid_anchor_generator {
scales: [0.25, 0.5, 1.0, 2.0]
aspect_ratios: [0.5, 1.0, 2.0]
height_stride: 16
width_stride: 16
}
}
first_stage_box_predictor_conv_hyperparams {
op: CONV
regularizer {
l2_regularizer {
weight: 0.0
}
}
initializer {
truncated_normal_initializer {
stddev: 0.01
}
}
}
first_stage_nms_score_threshold: 0.0
first_stage_nms_iou_threshold: 0.7
first_stage_max_proposals: 300
first_stage_localization_loss_weight: 2.0
first_stage_objectness_loss_weight: 1.0
initial_crop_size: 14
maxpool_kernel_size: 2
maxpool_stride: 2
second_stage_box_predictor {
mask_rcnn_box_predictor {
use_dropout: false
dropout_keep_probability: 1.0
fc_hyperparams {
op: FC
regularizer {
l2_regularizer {
weight: 0.0
}
}
initializer {
variance_scaling_initializer {
factor: 1.0
uniform: true
mode: FAN_AVG
}
}
}
}
}
second_stage_post_processing {
batch_non_max_suppression {
score_threshold: 0.0
iou_threshold: 0.6
max_detections_per_class: 100
max_total_detections: 300
}
score_converter: SOFTMAX
}
second_stage_localization_loss_weight: 2.0
second_stage_classification_loss_weight: 1.0
}
}
train_config: {
batch_size: 1
optimizer {
momentum_optimizer: {
learning_rate: {
manual_step_learning_rate {
initial_learning_rate: 0.0003
schedule {
step: 0
learning_rate: .0003
}
schedule {
step: 900000
learning_rate: .00003
}
schedule {
step: 1200000
learning_rate: .000003
}
}
}
momentum_optimizer_value: 0.9
}
use_moving_average: false
}
gradient_clipping_by_norm: 10.0
fine_tune_checkpoint: "/home/bertalan/tensorflow/models/faster_rcnn_inception_resnet_v2_atrous_coco_11_06_2017/model.ckpt"
from_detection_checkpoint: true
data_augmentation_options {
random_horizontal_flip {
}
}
}
train_input_reader: {
tf_record_input_reader {
input_path: "python /home/bertalan/tensorflow/models/pet_train.record"
}
label_map_path: "/home/bertalan/tensorflow/models/object_detection/data/pet_label_map.pbtxt"
}
eval_config: {
num_examples: 2000
}
eval_input_reader: {
tf_record_input_reader {
input_path: "/home/bertalan/tensorflow/models/pet_val.record"
}
label_map_path: "/home/bertalan/tensorflow/models/object_detection/data/pet_label_map.pbtxt"
shuffle: false
num_readers: 1
}
Your checkpoint model and feature extracting model is different.
Related
YoloV4 keras to TensorRT
I am custom training YoloV4 based on keras from this repo: https://github.com/taipingeric/yolo-v4-tf.keras model = Yolov4(weight_path=weights, class_name_path=class_name_path) model.load_model('saved_model/with es') #model.predict('input.jpg') conversion_params = trt.DEFAULT_TRT_CONVERSION_PARAMS._replace( precision_mode=trt.TrtPrecisionMode.FP16, max_workspace_size_bytes=4000000000, max_batch_size=4) converter = trt.TrtGraphConverterV2( input_saved_model_dir=saved_model, conversion_params=conversion_params) converter.convert() converter.save(output_saved_model_dir) And then load the same model using NVIDIA's TF20-TF-TRT guide: saved_model_loaded = tf.saved_model.load(input_saved_model, tags=[tag_constants.SERVING]) signature_keys = list(saved_model_loaded.signatures.keys()) print(signature_keys) infer = saved_model_loaded.signatures['serving_default'] print(infer.structured_outputs) But when I try to infer from it using labeling = infer(x) I get the following error: 2022-03-02 10:52:18.993365: W tensorflow/compiler/tf2tensorrt/kernels/trt_engine_op.cc:628] TF-TRT Warning: Engine retrieval for input shapes: [[1,416,416,3]] failed. Running native segment for PartitionedCall/TRTEngineOp_0_1 2022-03-02 10:52:19.035613: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED 2022-03-02 10:52:19.035702: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at conv_ops.cc:1106 : Not found: No algorithm worked! 2022-03-02 10:52:19.035770: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at trt_engine_op.cc:400 : Not found: No algorithm worked! [[{{node StatefulPartitionedCall/model_1/conv2d/Conv2D}}]] Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/mnt/d/Testing/research/yolo-v4/yolo-v4/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 1669, in __call__ return self._call_impl(args, kwargs) File "/mnt/d/Testing/research/yolo-v4/yolo-v4/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 1679, in _call_impl cancellation_manager) File "/mnt/d/Testing/research/yolo-v4/yolo-v4/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 1762, in _call_with_structured_signature cancellation_manager=cancellation_manager) File "/mnt/d/Testing/research/yolo-v4/yolo-v4/lib/python3.6/site-packages/tensorflow/python/saved_model/load.py", line 116, in _call_flat cancellation_manager) File "/mnt/d/Testing/research/yolo-v4/yolo-v4/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 1919, in _call_flat ctx, args, cancellation_manager=cancellation_manager)) File "/mnt/d/Testing/research/yolo-v4/yolo-v4/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 560, in call ctx=ctx) File "/mnt/d/Testing/research/yolo-v4/yolo-v4/lib/python3.6/site-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute inputs, attrs, num_outputs) tensorflow.python.framework.errors_impl.NotFoundError: No algorithm worked! [[{{node StatefulPartitionedCall/model_1/conv2d/Conv2D}}]] [[PartitionedCall/TRTEngineOp_0_1]] [Op:__inference_signature_wrapper_130807] Function call stack: signature_wrapper
I keep on getting these error in F-RCNN when I train the model
When I run it for 1000 epochs in between it also gives similar exception and after it's finished with 1000 epochs, it continues iterations of these same exceptions. Github code for the model - https://github.com/bhatt-priyadutt/blink-rate Exception: 2 root error(s) found. (0) Invalid argument: slice index 2 of dimension 1 out of bounds. [[node roi_pooling_conv_1/strided_slice_13 (defined at usr/local/lib/python3.7/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]] [[add_22/_305]] (1) Invalid argument: slice index 2 of dimension 1 out of bounds. [[node roi_pooling_conv_1/strided_slice_13 (defined at usr/local/lib/python3.7/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]] 0 successful operations. 0 derived errors ignored. Original stack trace for 'roi_pooling_conv_1/strided_slice_13': File "content/blink-rate/train_frcnn.py", line 134, in <module> classifier = nn.classifier(shared_layers, roi_input, C.num_rois, nb_classes=len(classes_count), trainable=True) File "content/blink-rate/keras_frcnn/resnet.py", line 239, in classifier out_roi_pool = RoiPoolingConv(pooling_regions, num_rois)([base_layers, input_rois]) File "usr/local/lib/python3.7/dist-packages/keras/engine/topology.py", line 578, in __call__ output = self.call(inputs, **kwargs) File "content/blink-rate/keras_frcnn/RoiPoolingConv.py", line 65, in call h = rois[0, roi_idx, 3] File "usr/local/lib/python3.7/dist-packages/tensorflow_core/python/ops/array_ops.py", line 802, in _slice_helper name=name) File "usr/local/lib/python3.7/dist-packages/tensorflow_core/python/ops/array_ops.py", line 968, in strided_slice shrink_axis_mask=shrink_axis_mask) File "usr/local/lib/python3.7/dist-packages/tensorflow_core/python/ops/gen_array_ops.py", line 10392, in strided_slice shrink_axis_mask=shrink_axis_mask, name=name) File "usr/local/lib/python3.7/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper op_def=op_def) File "usr/local/lib/python3.7/dist-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func return func(*args, **kwargs) File "usr/local/lib/python3.7/dist-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op attrs, op_def, compute_device) File "usr/local/lib/python3.7/dist-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal op_def=op_def) File "usr/local/lib/python3.7/dist-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__ self._traceback = tf_stack.extract_stack() Exception: index 1000 is out of bounds for axis 0 with size 1000 Exception: index 1000 is out of bounds for axis 0 with size 1000 Exception: index 1000 is out of bounds for axis 0 with size 1000 2021-09-26 15:41:01.791109: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at strided_slice_op.cc:108 : Invalid argument: slice index 6 of dimension 1 out of bounds. 2021-09-26 15:41:01.791144: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at strided_slice_op.cc:108 : Invalid argument: slice index 4 of dimension 1 out of bounds.
I am getting error "Restoring from checkpoint failed." while training tensorflow estimator api on AI-platform(ml-engine)
I am trying to do hyperparameter tuning on ai-engine for DNN regressor using tensorflow estimator api. But after submitting the job, it shows that job is failed and I get this error in job details. Can someone help? Hyperparameter Tuning Trial #1 Failed before any other successful trials were completed. The failed trial had parameters: learning_rate=0.0019937718716419557, num-layers=2, first-layer-size=148, scale-factor=0.7910729020312588, . The trial's error message was: The replica master 0 exited with a non-zero status of 1. Traceback (most recent call last): [...] File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 507, in _build_internal restore_sequentially, reshape) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 385, in _AddShardedRestoreOps name="restore_shard")) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 332, in _AddRestoreOps restore_sequentially) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 580, in bulk_restore return io_ops.restore_v2(filename_tensor, names, slices, dtypes) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 1572, in restore_v2 name=name) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper op_def=op_def) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func return func(*args, **kwargs) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3300, in create_op op_def=op_def) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1801, in __init__ self._traceback = tf_stack.extract_stack() InvalidArgumentError (see above for traceback): Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error: tensor_name = dnn/hiddenlayer_0/bias; shape in shape_and_slice spec [148] does not match the shape stored in checkpoint: [117] [[node save/RestoreV2_1 (defined at /usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/estimator.py:1403) ]]
Looks like you are using the same output directory for all the trials, and so trial#1 is trying to read trial#2 checkpoint (perhaps because it is the latest one in the directory) and failing because the architecture is different Make sure to use a different output directory for each hyperparam training run. There are two ways you do this: Use the --job-dir as the output directory. Append a hyperparam trial number to the output directory you are using now: output_dir = os.path.join( output_dir, json.loads( os.environ.get('TF_CONFIG', '{}') ).get('task', {}).get('trial', '') )
Python list error for count vectorizer and fit function
Please tell what is wrong and how to rectify. data = open(r"C:\Users\HS\Desktop\WORK\R\R DATA\g textonly2.txt").read() labels, texts = [], [] #print(data) for i, line in enumerate(data.split("\n")): content = line.split() #print(content) if len(content) is not 0: labels.append(content[0]) texts.append(content[1:]) # create a dataframe using texts and lables trainDF = pandas.DataFrame() trainDF['text'] = texts trainDF['label'] = labels # split the dataset into training and validation datasets train_x, valid_x, train_y, valid_y = model_selection.train_test_split(trainDF['text'], trainDF['label']) # label encode the target variable encoder = preprocessing.LabelEncoder() train_y = encoder.fit_transform(train_y) valid_y = encoder.fit_transform(valid_y) # create a count vectorizer object count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}') count_vect.fit(trainDF['text']) The data file contains data like this: 0 #\xdaltimahora Es tracta d'un aparell de Germanwings amb 152 passatgers a bord 0 Route map now being shared by http: 0 Pray for #4U9525 http: 0 Airbus A320 #4U9525 crash: \nFlight tracking data here: \nhttp Error: Traceback: "C:\Program Files\Python36\python.exe" "C:/Users/HS/PycharmProjects/R/C/Text classification1.py" Using TensorFlow backend. Traceback (most recent call last): File "C:/Users/HS/PycharmProjects/R/C/Text classification1.py", line 38, in <module> count_vect.fit(trainDF['text']) File "C:\Program Files\Python36\lib\site-packages\sklearn\feature_extraction\text.py", line 836, in fit self.fit_transform(raw_documents) File "C:\Program Files\Python36\lib\site-packages\sklearn\feature_extraction\text.py", line 869, in fit_transform self.fixed_vocabulary_) File "C:\Program Files\Python36\lib\site-packages\sklearn\feature_extraction\text.py", line 792, in _count_vocab for feature in analyze(doc): File "C:\Program Files\Python36\lib\site-packages\sklearn\feature_extraction\text.py", line 266, in <lambda> tokenize(preprocess(self.decode(doc))), stop_words) File "C:\Program Files\Python36\lib\site-packages\sklearn\feature_extraction\text.py", line 232, in <lambda> return lambda x: strip_accents(x.lower()) AttributeError: 'list' object has no attribute 'lower' Process finished with exit code 1
From the documentation: fit(raw_documents, y=None)[source] Learn a vocabulary dictionary of all tokens in the raw documents. Parameters: raw_documents : iterable An iterable which yields either str, unicode or file objects. Returns: self : You get the error AttributeError: 'list' object has no attribute 'lower' because you gave it an iterable (in this case a pd.Series) of list objects, instead of an iterable of strings. You should be able to fix this by using texts.append(' '.join(content[1:])) instead of texts.append(content[1:]): for i, line in enumerate(data.split("\n")): content = line.split() #print(content) if len(content) is not 0: labels.append(content[0]) #texts.append(content[1:]) texts.append(' '.join(content[1:]))
spark java.lang.stackoverflow logistic regression fit with large dataset
I am trying to fit a logistic regression model for a data set with 470 features and 10 million training instances. Here is a snippet of my code. from pyspark.ml import Pipeline from pyspark.ml.classification import LogisticRegression from pyspark.ml.feature import RFormula formula = RFormula(formula = "label ~ .-classWeight") bestregLambdaVal = 0.005 bestregAlphaVal = 0.01 lr = LogisticRegression(maxIter=1000, regParam=bestregLambdaVal, elasticNetParam=bestregAlphaVal,weightCol="classWeight") pipeLineLr = Pipeline(stages = [formula, lr]) pipeLineFit = pipeLineLr.fit(mySparkDataFrame[featureColumnNameList + ['classWeight','label']]) I have also created a checkpoint directory, sc.setCheckpointDir('checkpoint/') as suggested here: Spark gives a StackOverflowError when training using ALS However I get an error and here is a partial trace: File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/base.py", line 64, in fit File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/pipeline.py", line 108, in _fit File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/base.py", line 64, in fit File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", line 265, in _fit File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", line 262, in _fit_java File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__ File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o383361.fit. : java.lang.StackOverflowError at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189) at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548) at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509) at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432) at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178) at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348) at scala.collection.immutable.List$SerializationProxy.writeObject(List.scala:468) at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source) I would also like to note that the 470 feature columns were iteratively added to spark data frame using withcolumn().
So the mistake I was making is that, when checkpointing the dataframe, I would only do: mySparkDataFrame.checkpoint(eager=True) The right was to do: mySparkDataFrame = mySparkDataFrame.checkpoint(eager=True) This is based on another question I had asked (and got an answer for) here: pyspark rdd isCheckPointed() is false Also, it is recommended to persist() the dataframe before checkpoint and also to count() it after the checkpoint