TensorFlow object detection error when training

TensorFlow object detection error when training - object

Hej Guys !
I am trying to run locally the cat example and I got stuck on the trainning step. I get this very long error. Could someone help me out to understand what is wrong ?
Thanks in advance.
The command:
bertalan#mbqs:~/tensorflow/models$ python object_detection/train.py --logtostderr --pipeline_config_path=/home/bertalan/tensorflow/models/object_detection/samples/configs/Myfaster_rcnn_resnet101_pets.config --train_dir=TrainCat
Here the output with error:
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Scale of 0 disables regularizer.
INFO:tensorflow:Summary name Learning Rate is illegal; using Learning_Rate instead.
INFO:tensorflow:Summary name /clone_loss is illegal; using clone_loss instead.
/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gradients_impl.py:93: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
"Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
2017-07-26 11:27:18.416172: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-26 11:27:18.416220: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-26 11:27:18.416248: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-07-26 11:27:20.437921: I tensorflow/core/common_runtime/simple_placer.cc:669] Ignoring device specification /device:GPU:0 for node 'prefetch_queue_Dequeue' because the input edge from 'prefetch_queue' is a reference connection and already has a device field set to /device:CPU:0
INFO:tensorflow:Restoring parameters from /home/bertalan/tensorflow/models/faster_rcnn_inception_resnet_v2_atrous_coco_11_06_2017/model.ckpt
2017-07-26 11:27:25.179612: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block4/unit_1/bottleneck_v1/conv3/BatchNorm/beta not found in checkpoint
2017-07-26 11:27:25.179639: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block4/unit_1/bottleneck_v1/conv2/weights not found in checkpoint
2017-07-26 11:27:25.179673: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block4/unit_1/bottleneck_v1/conv3/BatchNorm/moving_mean not found in checkpoint
2017-07-26 11:27:25.179612: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block4/unit_1/bottleneck_v1/conv3/BatchNorm/gamma not found in checkpoint
2017-07-26 11:27:25.185127: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block4/unit_1/bottleneck_v1/conv3/BatchNorm/moving_variance not found in checkpoint
2017-07-26 11:27:25.187191: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block4/unit_1/bottleneck_v1/conv3/weights not found in checkpoint
2017-07-26 11:27:25.187614: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block4/unit_1/bottleneck_v1/shortcut/BatchNorm/gamma not found in checkpoint
2017-07-26 11:27:25.188036: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block4/unit_1/bottleneck_v1/shortcut/BatchNorm/beta not found in checkpoint
2017-07-26 11:27:25.188324: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block4/unit_1/bottleneck_v1/conv2/BatchNorm/moving_variance not found in checkpoint
2017-07-26 11:27:25.189131: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block4/unit_1/bottleneck_v1/shortcut/weights not found in checkpoint
2017-07-26 11:27:25.190319: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block4/unit_1/bottleneck_v1/shortcut/BatchNorm/moving_variance not found in checkpoint
2017-07-26 11:27:25.190613: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block4/unit_1/bottleneck_v1/shortcut/BatchNorm/moving_mean not found in checkpoint
2017-07-26 11:27:25.190923: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block4/unit_1/bottleneck_v1/conv2/BatchNorm/moving_mean not found in checkpoint
2017-07-26 11:27:25.191949: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block4/unit_2/bottleneck_v1/conv1/BatchNorm/moving_mean not found in checkpoint
2017-07-26 11:27:25.192728: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block4/unit_2/bottleneck_v1/conv1/BatchNorm/gamma not found in checkpoint
2017-07-26 11:27:25.193354: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block4/unit_2/bottleneck_v1/conv1/BatchNorm/moving_variance not found in checkpoint
2017-07-26 11:27:25.194102: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key
...
2017-07-26 11:27:25.204869: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block4/unit_3/bottleneck_v1/conv1/BatchNorm/gamma not found in checkpoint
2017-07-26 11:27:25.205198: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block4/unit_3/bottleneck_v1/conv1/BatchNorm/moving_mean not found in checkpoint
2017-07-26 11:27:25.205799: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block4/unit_1/bottleneck_v1/conv1/BatchNorm/moving_mean not found in checkpoint
2017-07-26 11:27:25.205853: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block4/unit_3/bottleneck_v1/conv1/BatchNorm/moving_variance not found in checkpoint
2017-07-26 11:27:25.209234: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block4/unit_3/bottleneck_v1/conv2/BatchNorm/beta not found in checkpoint
2017-07-26 11:27:25.210446: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block4/unit_3/bottleneck_v1/conv1/weights not found in checkpoint
2017-07-26 11:27:25.210829: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block4/unit_1/bottleneck_v1/conv1/BatchNorm/gamma not found in checkpoint
2017-07-26 11:27:25.212274: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block4/unit_3/bottleneck_v1/conv2/BatchNorm/gamma not found in checkpoint
2017-07-26 11:27:25.212305: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block4/unit_3/bottleneck_v1/conv2/BatchNorm/moving_mean not found in checkpoint Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block3/unit_20/bottleneck_v1/conv2/BatchNorm/moving_variance not found in checkpoint
2017-07-26 11:27:25.613441: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block3/unit_20/bottleneck_v1/conv3/BatchNorm/beta not found in checkpoint
2017-07-26 11:27:25.613721: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block3/unit_20/bottleneck_v1/conv3/BatchNorm/gamma not found in checkpoint
2017-07-26 11:27:25.615790: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block3/unit_20/bottleneck_v1/conv3/weights not found in checkpoint
2017-07-26 11:27:25.615937: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block3/unit_20/bottleneck_v1/conv3/BatchNorm/moving_mean not found in checkpoint
2017-07-26 11:27:25.616601: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block3/unit_20/bottleneck_v1/conv3/BatchNorm/moving_variance not found in checkpoint
2017-07-26 11:27:25.616872: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block3/unit_21/bottleneck_v1/conv1/BatchNorm/beta not found in checkpoint
2017-07-26 11:27:25.617185: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block3/unit_22/bottleneck_v1/conv2/BatchNorm/moving_variance not found in checkpoint
2017-07-26 11:27:25.617505: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block3/unit_21/bottleneck_v1/conv1/BatchNorm/gamma not found in checkpoint
2017-07-26 11:27:25.618701: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block3/unit_21/bottleneck_v1/conv1/BatchNorm/moving_mean not found in checkpoint
2017-07-26 11:27:25.618781: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block3/unit_22/bottleneck_v1/conv2/BatchNorm/moving_mean not found in checkpoint
2017-07-26 11:27:25.620022: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block3/unit_21/bottleneck_v1/conv1/weights not found in checkpoint
2017-07-26 11:27:25.621149: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block3/unit_21/bottleneck_v1/conv2/BatchNorm/beta not found in checkpoint
2017-07-26 11:27:25.621225: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block3/unit_22/bottleneck_v1/conv2/BatchNorm/gamma not found in checkpoint
2017-07-26 11:27:25.621225: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block3/unit_21/bottleneck_v1/conv1/BatchNorm/moving_variance not found in checkpoint
2017-07-26 11:27:25.623092: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block3/unit_21/bottleneck_v1/conv2/BatchNorm/gamma not found in checkpoint
2017-07-26 11:27:25.624135: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block3/unit_21/bottleneck_v1/conv2/BatchNorm/moving_variance not found in checkpoint
2017-07-26 11:27:25.627327: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block3/unit_21/bottleneck_v1/conv2/BatchNorm/moving_mean not found in checkpoint
2017-07-26 11:27:25.627572: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block3/unit_21/bottleneck_v1/conv2/weights not found in checkpoint
2017-07-26 11:27:25.628414: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block3/unit_21/bottleneck_v1/conv3/BatchNorm/gamma not found in checkpoint
2017-07-26 11:27:25.628844: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block3/unit_21/bottleneck_v1/conv3/BatchNorm/moving_mean not found in checkpoint
2017-07-26 11:27:25.629118: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block3/unit_22/bottleneck_v1/conv2/BatchNorm/beta not found in checkpoint
2017-07-26 11:27:25.629480: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block3/unit_21/bottleneck_v1/conv3/BatchNorm/beta not found in checkpoint
2017-07-26 11:27:25.629624: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block3/unit_21/bottleneck_v1/conv3/BatchNorm/moving_variance not found in checkpoint
2017-07-26 11:27:25.630848: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block3/unit_22/bottleneck_v1/conv1/weights not found in checkpoint
2017-07-26 11:27:25.631122: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block3/unit_22/bottleneck_v1/conv1/BatchNorm/beta not found in checkpoint
2017-07-26 11:27:25.631167: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block3/unit_21/bottleneck_v1/conv3/weights not found in checkpoint
2017-07-26 11:27:25.632471: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block3/unit_22/bottleneck_v1/conv1/BatchNorm/gamma not found in checkpoint
2017-07-26 11:27:25.633056: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block3/unit_22/bottleneck_v1/conv1/BatchNorm/moving_variance not found in checkpoint
2017-07-26 11:27:25.633295: W tensorflow/core/framework/op_kernel.cc:1152] Not found: Key FirstStageFeatureExtractor/resnet_v1_101/block3/unit_22/bottleneck_v1/conv1/BatchNorm/moving_mean not found in checkpoint
INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.NotFoundError'>, Key FirstStageFeatureExtractor/resnet_v1_101/block4/unit_1/bottleneck_v1/conv3/BatchNorm/beta not found in checkpoint
[[Node: save/RestoreV2_475 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_save/Const_0, save/RestoreV2_475/tensor_names, save/RestoreV2_475/shape_and_slices)]]
Caused by op u'save/RestoreV2_475', defined at:
File "object_detection/train.py", line 198, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "object_detection/train.py", line 194, in main
worker_job_name, is_chief, FLAGS.train_dir)
File "/home/bertalan/tensorflow/models/object_detection/trainer.py", line 216, in train
from_detection_checkpoint=train_config.from_detection_checkpoint)
File "/home/bertalan/tensorflow/models/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 1447, in restore_fn
saver = tf.train.Saver(first_stage_variables)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1056, in __init__
self.build()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1086, in build
restore_sequentially=self._restore_sequentially)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 691, in build
restore_sequentially, reshape)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 407, in _AddRestoreOps
tensors = self.restore_op(filename_tensor, saveable, preferred_shard)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 247, in restore_op
[spec.tensor.dtype])[0])
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 669, in restore_v2
dtypes=dtypes, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 768, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2336, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1228, in __init__
self._traceback = _extract_stack()
NotFoundError (see above for traceback): Key FirstStageFeatureExtractor/resnet_v1_101/block4/unit_1/bottleneck_v1/conv3/BatchNorm/beta not found in checkpoint
[[Node: save/RestoreV2_475 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_save/Const_0, save/RestoreV2_475/tensor_names, save/RestoreV2_475/shape_and_slices)]]
Traceback (most recent call last):
File "object_detection/train.py", line 198, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "object_detection/train.py", line 194, in main
worker_job_name, is_chief, FLAGS.train_dir)
File "/home/bertalan/tensorflow/models/object_detection/trainer.py", line 290, in train
saver=saver)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/slim/python/slim/learning.py", line 725, in train
master, start_standard_services=False, config=session_config) as sess:
File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
return self.gen.next()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 960, in managed_session
self.stop(close_summary_writer=close_summary_writer)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 788, in stop
stop_grace_period_secs=self._stop_grace_secs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/coordinator.py", line 389, in join
six.reraise(*self._exc_info_to_raise)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 949, in managed_session
start_standard_services=start_standard_services)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/supervisor.py", line 706, in prepare_or_wait_for_session
init_feed_dict=self._init_feed_dict, init_fn=self._init_fn)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/session_manager.py", line 264, in prepare_session
init_fn(sess)
File "/home/bertalan/tensorflow/models/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 1450, in restore
saver.restore(sess, checkpoint_path)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1457, in restore
{self.saver_def.filename_tensor_name: save_path})
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 778, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 982, in _run
feed_dict_string, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1032, in _do_run
target_list, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1052, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: Key FirstStageFeatureExtractor/resnet_v1_101/block4/unit_1/bottleneck_v1/conv3/BatchNorm/beta not found in checkpoint
[[Node: save/RestoreV2_475 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_save/Const_0, save/RestoreV2_475/tensor_names, save/RestoreV2_475/shape_and_slices)]]
Caused by op u'save/RestoreV2_475', defined at:
File "object_detection/train.py", line 198, in <module>
tf.app.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "object_detection/train.py", line 194, in main
worker_job_name, is_chief, FLAGS.train_dir)
File "/home/bertalan/tensorflow/models/object_detection/trainer.py", line 216, in train
from_detection_checkpoint=train_config.from_detection_checkpoint)
File "/home/bertalan/tensorflow/models/object_detection/meta_architectures/faster_rcnn_meta_arch.py", line 1447, in restore_fn
saver = tf.train.Saver(first_stage_variables)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1056, in __init__
self.build()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 1086, in build
restore_sequentially=self._restore_sequentially)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 691, in build
restore_sequentially, reshape)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 407, in _AddRestoreOps
tensors = self.restore_op(filename_tensor, saveable, preferred_shard)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/saver.py", line 247, in restore_op
[spec.tensor.dtype])[0])
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 669, in restore_v2
dtypes=dtypes, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 768, in apply_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2336, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1228, in __init__
self._traceback = _extract_stack()
NotFoundError (see above for traceback): Key FirstStageFeatureExtractor/resnet_v1_101/block4/unit_1/bottleneck_v1/conv3/BatchNorm/beta not found in checkpoint
[[Node: save/RestoreV2_475 = RestoreV2[dtypes=[DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"](_recv_save/Const_0, save/RestoreV2_475/tensor_names, save/RestoreV2_475/shape_and_slices)]]
By the way, here is my config file:
# Faster R-CNN with Resnet-101 (v1) configured for the Oxford-IIIT Pet Dataset.
# Users should configure the fine_tune_checkpoint field in the train config as
# well as the label_map_path and input_path fields in the train_input_reader and
# eval_input_reader. Search for "PATH_TO_BE_CONFIGURED" to find the fields that
# should be configured.
model {
faster_rcnn {
num_classes: 37
image_resizer {
keep_aspect_ratio_resizer {
min_dimension: 600
max_dimension: 1024
}
}
feature_extractor {
type: 'faster_rcnn_resnet101'
first_stage_features_stride: 16
}
first_stage_anchor_generator {
grid_anchor_generator {
scales: [0.25, 0.5, 1.0, 2.0]
aspect_ratios: [0.5, 1.0, 2.0]
height_stride: 16
width_stride: 16
}
}
first_stage_box_predictor_conv_hyperparams {
op: CONV
regularizer {
l2_regularizer {
weight: 0.0
}
}
initializer {
truncated_normal_initializer {
stddev: 0.01
}
}
}
first_stage_nms_score_threshold: 0.0
first_stage_nms_iou_threshold: 0.7
first_stage_max_proposals: 300
first_stage_localization_loss_weight: 2.0
first_stage_objectness_loss_weight: 1.0
initial_crop_size: 14
maxpool_kernel_size: 2
maxpool_stride: 2
second_stage_box_predictor {
mask_rcnn_box_predictor {
use_dropout: false
dropout_keep_probability: 1.0
fc_hyperparams {
op: FC
regularizer {
l2_regularizer {
weight: 0.0
}
}
initializer {
variance_scaling_initializer {
factor: 1.0
uniform: true
mode: FAN_AVG
}
}
}
}
}
second_stage_post_processing {
batch_non_max_suppression {
score_threshold: 0.0
iou_threshold: 0.6
max_detections_per_class: 100
max_total_detections: 300
}
score_converter: SOFTMAX
}
second_stage_localization_loss_weight: 2.0
second_stage_classification_loss_weight: 1.0
}
}
train_config: {
batch_size: 1
optimizer {
momentum_optimizer: {
learning_rate: {
manual_step_learning_rate {
initial_learning_rate: 0.0003
schedule {
step: 0
learning_rate: .0003
}
schedule {
step: 900000
learning_rate: .00003
}
schedule {
step: 1200000
learning_rate: .000003
}
}
}
momentum_optimizer_value: 0.9
}
use_moving_average: false
}
gradient_clipping_by_norm: 10.0
fine_tune_checkpoint: "/home/bertalan/tensorflow/models/faster_rcnn_inception_resnet_v2_atrous_coco_11_06_2017/model.ckpt"
from_detection_checkpoint: true
data_augmentation_options {
random_horizontal_flip {
}
}
}
train_input_reader: {
tf_record_input_reader {
input_path: "python /home/bertalan/tensorflow/models/pet_train.record"
}
label_map_path: "/home/bertalan/tensorflow/models/object_detection/data/pet_label_map.pbtxt"
}
eval_config: {
num_examples: 2000
}
eval_input_reader: {
tf_record_input_reader {
input_path: "/home/bertalan/tensorflow/models/pet_val.record"
}
label_map_path: "/home/bertalan/tensorflow/models/object_detection/data/pet_label_map.pbtxt"
shuffle: false
num_readers: 1
}

Your checkpoint model and feature extracting model is different.

Related

YoloV4 keras to TensorRT

I am custom training YoloV4 based on keras from this repo: https://github.com/taipingeric/yolo-v4-tf.keras
model = Yolov4(weight_path=weights,
class_name_path=class_name_path)
model.load_model('saved_model/with es')
#model.predict('input.jpg')
conversion_params = trt.DEFAULT_TRT_CONVERSION_PARAMS._replace(
precision_mode=trt.TrtPrecisionMode.FP16,
max_workspace_size_bytes=4000000000,
max_batch_size=4)
converter = trt.TrtGraphConverterV2(
input_saved_model_dir=saved_model, conversion_params=conversion_params)
converter.convert()
converter.save(output_saved_model_dir)
And then load the same model using NVIDIA's TF20-TF-TRT guide:
saved_model_loaded = tf.saved_model.load(input_saved_model, tags=[tag_constants.SERVING])
signature_keys = list(saved_model_loaded.signatures.keys())
print(signature_keys)
infer = saved_model_loaded.signatures['serving_default']
print(infer.structured_outputs)
But when I try to infer from it using
labeling = infer(x)
I get the following error:
2022-03-02 10:52:18.993365: W tensorflow/compiler/tf2tensorrt/kernels/trt_engine_op.cc:628] TF-TRT Warning: Engine retrieval for input shapes: [[1,416,416,3]] failed. Running native segment for PartitionedCall/TRTEngineOp_0_1
2022-03-02 10:52:19.035613: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2022-03-02 10:52:19.035702: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at conv_ops.cc:1106 : Not found: No algorithm worked!
2022-03-02 10:52:19.035770: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at trt_engine_op.cc:400 : Not found: No algorithm worked!
[[{{node StatefulPartitionedCall/model_1/conv2d/Conv2D}}]]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/mnt/d/Testing/research/yolo-v4/yolo-v4/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 1669, in __call__
return self._call_impl(args, kwargs)
File "/mnt/d/Testing/research/yolo-v4/yolo-v4/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 1679, in _call_impl
cancellation_manager)
File "/mnt/d/Testing/research/yolo-v4/yolo-v4/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 1762, in _call_with_structured_signature
cancellation_manager=cancellation_manager)
File "/mnt/d/Testing/research/yolo-v4/yolo-v4/lib/python3.6/site-packages/tensorflow/python/saved_model/load.py", line 116, in _call_flat
cancellation_manager)
File "/mnt/d/Testing/research/yolo-v4/yolo-v4/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 1919, in _call_flat
ctx, args, cancellation_manager=cancellation_manager))
File "/mnt/d/Testing/research/yolo-v4/yolo-v4/lib/python3.6/site-packages/tensorflow/python/eager/function.py", line 560, in call
ctx=ctx)
File "/mnt/d/Testing/research/yolo-v4/yolo-v4/lib/python3.6/site-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.NotFoundError: No algorithm worked!
[[{{node StatefulPartitionedCall/model_1/conv2d/Conv2D}}]]
[[PartitionedCall/TRTEngineOp_0_1]] [Op:__inference_signature_wrapper_130807]
Function call stack:
signature_wrapper

I keep on getting these error in F-RCNN when I train the model

When I run it for 1000 epochs in between it also gives similar exception and after it's finished with 1000 epochs, it continues iterations of these same exceptions. Github code for the model - https://github.com/bhatt-priyadutt/blink-rate
Exception: 2 root error(s) found.
(0) Invalid argument: slice index 2 of dimension 1 out of bounds.
[[node roi_pooling_conv_1/strided_slice_13 (defined at usr/local/lib/python3.7/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
[[add_22/_305]]
(1) Invalid argument: slice index 2 of dimension 1 out of bounds.
[[node roi_pooling_conv_1/strided_slice_13 (defined at usr/local/lib/python3.7/dist-packages/tensorflow_core/python/framework/ops.py:1748) ]]
0 successful operations.
0 derived errors ignored.
Original stack trace for 'roi_pooling_conv_1/strided_slice_13':
File "content/blink-rate/train_frcnn.py", line 134, in <module>
classifier = nn.classifier(shared_layers, roi_input, C.num_rois, nb_classes=len(classes_count), trainable=True)
File "content/blink-rate/keras_frcnn/resnet.py", line 239, in classifier
out_roi_pool = RoiPoolingConv(pooling_regions, num_rois)([base_layers, input_rois])
File "usr/local/lib/python3.7/dist-packages/keras/engine/topology.py", line 578, in __call__
output = self.call(inputs, **kwargs)
File "content/blink-rate/keras_frcnn/RoiPoolingConv.py", line 65, in call
h = rois[0, roi_idx, 3]
File "usr/local/lib/python3.7/dist-packages/tensorflow_core/python/ops/array_ops.py", line 802, in _slice_helper
name=name)
File "usr/local/lib/python3.7/dist-packages/tensorflow_core/python/ops/array_ops.py", line 968, in strided_slice
shrink_axis_mask=shrink_axis_mask)
File "usr/local/lib/python3.7/dist-packages/tensorflow_core/python/ops/gen_array_ops.py", line 10392, in strided_slice
shrink_axis_mask=shrink_axis_mask, name=name)
File "usr/local/lib/python3.7/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
op_def=op_def)
File "usr/local/lib/python3.7/dist-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "usr/local/lib/python3.7/dist-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
attrs, op_def, compute_device)
File "usr/local/lib/python3.7/dist-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
op_def=op_def)
File "usr/local/lib/python3.7/dist-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
self._traceback = tf_stack.extract_stack()
Exception: index 1000 is out of bounds for axis 0 with size 1000
Exception: index 1000 is out of bounds for axis 0 with size 1000
Exception: index 1000 is out of bounds for axis 0 with size 1000
2021-09-26 15:41:01.791109: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at strided_slice_op.cc:108 : Invalid argument: slice index 6 of dimension 1 out of bounds.
2021-09-26 15:41:01.791144: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at strided_slice_op.cc:108 : Invalid argument: slice index 4 of dimension 1 out of bounds.

I am getting error "Restoring from checkpoint failed." while training tensorflow estimator api on AI-platform(ml-engine)

I am trying to do hyperparameter tuning on ai-engine for DNN regressor using tensorflow estimator api. But after submitting the job, it shows that job is failed and I get this error in job details.
Can someone help?
Hyperparameter Tuning Trial #1 Failed before any other successful trials were completed. The failed trial had parameters: learning_rate=0.0019937718716419557, num-layers=2, first-layer-size=148, scale-factor=0.7910729020312588, . The trial's error message was: The replica master 0 exited with a non-zero status of 1.
Traceback (most recent call last):
[...]
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 507, in _build_internal
restore_sequentially, reshape)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 385, in _AddShardedRestoreOps
name="restore_shard"))
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 332, in _AddRestoreOps
restore_sequentially)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/saver.py", line 580, in bulk_restore
return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_io_ops.py", line 1572, in restore_v2
name=name)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
op_def=op_def)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
self._traceback = tf_stack.extract_stack()
InvalidArgumentError (see above for traceback): Restoring from checkpoint failed. This is most likely due to a mismatch between the current graph and the graph from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:
tensor_name = dnn/hiddenlayer_0/bias; shape in shape_and_slice spec [148] does not match the shape stored in checkpoint: [117]
[[node save/RestoreV2_1 (defined at /usr/local/lib/python3.5/dist-packages/tensorflow_estimator/python/estimator/estimator.py:1403) ]]

Looks like you are using the same output directory for all the trials, and so trial#1 is trying to read trial#2 checkpoint (perhaps because it is the latest one in the directory) and failing because the architecture is different
Make sure to use a different output directory for each hyperparam training run. There are two ways you do this:
Use the --job-dir as the output directory.
Append a hyperparam trial number to the output directory you are using now:
output_dir = os.path.join( output_dir, json.loads( os.environ.get('TF_CONFIG', '{}') ).get('task', {}).get('trial', '') )

Python list error for count vectorizer and fit function

Please tell what is wrong and how to rectify.
data = open(r"C:\Users\HS\Desktop\WORK\R\R DATA\g textonly2.txt").read()
labels, texts = [], []
#print(data)
for i, line in enumerate(data.split("\n")):
content = line.split()
#print(content)
if len(content) is not 0:
labels.append(content[0])
texts.append(content[1:])
# create a dataframe using texts and lables
trainDF = pandas.DataFrame()
trainDF['text'] = texts
trainDF['label'] = labels
# split the dataset into training and validation datasets
train_x, valid_x, train_y, valid_y = model_selection.train_test_split(trainDF['text'], trainDF['label'])
# label encode the target variable
encoder = preprocessing.LabelEncoder()
train_y = encoder.fit_transform(train_y)
valid_y = encoder.fit_transform(valid_y)
# create a count vectorizer object
count_vect = CountVectorizer(analyzer='word', token_pattern=r'\w{1,}')
count_vect.fit(trainDF['text'])
The data file contains data like this:
0 #\xdaltimahora Es tracta d'un aparell de Germanwings amb 152 passatgers a bord
0 Route map now being shared by http:
0 Pray for #4U9525 http:
0 Airbus A320 #4U9525 crash: \nFlight tracking data here: \nhttp
Error:
Traceback:
"C:\Program Files\Python36\python.exe" "C:/Users/HS/PycharmProjects/R/C/Text classification1.py"
Using TensorFlow backend.
Traceback (most recent call last):
File "C:/Users/HS/PycharmProjects/R/C/Text classification1.py", line 38, in <module>
count_vect.fit(trainDF['text'])
File "C:\Program Files\Python36\lib\site-packages\sklearn\feature_extraction\text.py", line 836, in fit
self.fit_transform(raw_documents)
File "C:\Program Files\Python36\lib\site-packages\sklearn\feature_extraction\text.py", line 869, in fit_transform
self.fixed_vocabulary_)
File "C:\Program Files\Python36\lib\site-packages\sklearn\feature_extraction\text.py", line 792, in _count_vocab
for feature in analyze(doc):
File "C:\Program Files\Python36\lib\site-packages\sklearn\feature_extraction\text.py", line 266, in <lambda>
tokenize(preprocess(self.decode(doc))), stop_words)
File "C:\Program Files\Python36\lib\site-packages\sklearn\feature_extraction\text.py", line 232, in <lambda>
return lambda x: strip_accents(x.lower())
AttributeError: 'list' object has no attribute 'lower'
Process finished with exit code 1

From the documentation:
fit(raw_documents, y=None)[source] Learn a vocabulary dictionary of
all tokens in the raw documents.
Parameters: raw_documents : iterable
An iterable which yields either str, unicode or file objects.
Returns: self :
You get the error AttributeError: 'list' object has no attribute 'lower' because you gave it an iterable (in this case a pd.Series) of list objects, instead of an iterable of strings.
You should be able to fix this by using texts.append(' '.join(content[1:]))
instead of texts.append(content[1:]):
for i, line in enumerate(data.split("\n")):
content = line.split()
#print(content)
if len(content) is not 0:
labels.append(content[0])
#texts.append(content[1:])
texts.append(' '.join(content[1:]))

spark java.lang.stackoverflow logistic regression fit with large dataset

I am trying to fit a logistic regression model for a data set with 470 features and 10 million training instances. Here is a snippet of my code.
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import RFormula
formula = RFormula(formula = "label ~ .-classWeight")
bestregLambdaVal = 0.005
bestregAlphaVal = 0.01
lr = LogisticRegression(maxIter=1000, regParam=bestregLambdaVal, elasticNetParam=bestregAlphaVal,weightCol="classWeight")
pipeLineLr = Pipeline(stages = [formula, lr])
pipeLineFit = pipeLineLr.fit(mySparkDataFrame[featureColumnNameList + ['classWeight','label']])
I have also created a checkpoint directory,
sc.setCheckpointDir('checkpoint/')
as suggested here:
Spark gives a StackOverflowError when training using ALS
However I get an error and here is a partial trace:
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/base.py", line 64, in fit
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/pipeline.py", line 108, in _fit
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/base.py", line 64, in fit
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", line 265, in _fit
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", line 262, in _fit_java
File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o383361.fit.
: java.lang.StackOverflowError
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
at scala.collection.immutable.List$SerializationProxy.writeObject(List.scala:468)
at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source)
I would also like to note that the 470 feature columns were iteratively added to spark data frame using withcolumn().

So the mistake I was making is that, when checkpointing the dataframe, I would only do:
mySparkDataFrame.checkpoint(eager=True)
The right was to do:
mySparkDataFrame = mySparkDataFrame.checkpoint(eager=True)
This is based on another question I had asked (and got an answer for) here:
pyspark rdd isCheckPointed() is false
Also, it is recommended to persist() the dataframe before checkpoint and also to count() it after the checkpoint

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

TensorFlow object detection error when training - object

Your checkpoint model and feature extracting model is different.

Related

YoloV4 keras to TensorRT

I keep on getting these error in F-RCNN when I train the model

I am getting error "Restoring from checkpoint failed." while training tensorflow estimator api on AI-platform(ml-engine)

Python list error for count vectorizer and fit function

spark java.lang.stackoverflow logistic regression fit with large dataset

Categories

Resources