AWS Sagemaker KeyError: 'SM_CHANNEL_TRAINING' when tuning hyperparameters

AWS Sagemaker KeyError: 'SM_CHANNEL_TRAINING' when tuning hyperparameters - python-3.x

When I try to use hyperparameters tuning on Sagemaker I get this error:
UnexpectedStatusException: Error for HyperParameterTuning job imageclassif-job-10-21-47-43: Failed. Reason: No training job succeeded after 5 attempts. Please take a look at the training job failures to get more details.
When I look up the logs on CloudWatch all 5 failed training jobs have the same error at the end:
Traceback (most recent call last):
File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/opt/ml/code/train.py", line 117, in <module>
parser.add_argument('--data-dir', type=str, default=os.environ['SM_CHANNEL_TRAINING'])
File "/usr/lib/python3.5/os.py", line 725, in __getitem__
raise KeyError(key) from None
and
KeyError: 'SM_CHANNEL_TRAINING'
The problem is at the Step 4 of the project: https://github.com/petrooha/Deploying-LSTM/blob/main/SageMaker%20Project.ipynb
Would hihgly appreciate any hints on where to look next

In your train.py file, changing the environment variable from
parser.add_argument('--data-dir', type=str, default=os.environ['SM_CHANNEL_TRAINING'])
to
parser.add_argument('--data-dir', type=str, default=os.environ['SM_CHANNEL_TRAIN']) should address the issue.
This is the case with Torch's framework_version 1.3.1 but other versions might also be affected. Here is the link for your reference.

Related

How to solve "ValueError: Cannot create group in read only mode" during loading yolo model?

I'm writing a GUI application with wxpython. The application uses yolo to detect pavement breakage. I use the yolo code to train and detect. It is too time-consuming to load the yolo model, so the GUI will freeze. Therefore, I expect to show a progress bar during loading yolo model with threading.Thread. I can use main thread to load yolo model, but I get a exception during loading yolo model with a new thread.
The error:
Traceback (most recent call last):
File "C:\Program Files\Python36\lib\contextlib.py", line 99, in __exit__
self.gen.throw(type, value, traceback)
File "C:\Users\JH-06\AppData\Roaming\Python\Python36\site-packages\tensorflow\python\framework\ops.py", line 5652, in get_controller
yield g
File "d:\code\Python\yoloDetector_v007\src\YOLO\yolo.py", line 76, in generate
self.yolo_model = load_model(model_path, compile=False)
File "C:\Program Files\Python36\lib\site-packages\keras\engine\saving.py", line 419, in load_model
model = _deserialize_model(f, custom_objects, compile)
File "C:\Program Files\Python36\lib\site-packages\keras\engine\saving.py", line 221, in _deserialize_model
model_config = f['model_config']
File "C:\Program Files\Python36\lib\site-packages\keras\utils\io_utils.py", line 302, in __getitem__
raise ValueError('Cannot create group in read only mode.')
ValueError: Cannot create group in read only mode.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "d:\code\Python\yoloDetector_v007\src\myRoadDamageUtil\myRoadDetectionModel.py", line 166, in init
self.__m_oVideoDetector.init()
File "d:\code\Python\yoloDetector_v007\src\myRoadDamageUtil\myVideoDetector.py", line 130, in init
self.__m_oDetector.init()
File "d:\code\Python\yoloDetector_v007\src\myRoadDamageUtil\myRoadBreakageDetector.py", line 87, in init
self.__m_oYoloDetector.init()
File "d:\code\Python\yoloDetector_v007\src\YOLO\yolo.py", line 46, in init
self.boxes, self.scores, self.classes = self.generate()
File "d:\code\Python\yoloDetector_v007\src\YOLO\yolo.py", line 80, in generate
self.yolo_model.load_weights(self.model_path) # make sure model, anchors and classes match
File "C:\Program Files\Python36\lib\site-packages\keras\engine\network.py", line 1166, in load_weights
f, self.layers, reshape=reshape)
File "C:\Program Files\Python36\lib\site-packages\keras\engine\saving.py", line 1058, in load_weights_from_hdf5_group
K.batch_set_value(weight_value_tuples)
File "C:\Program Files\Python36\lib\site-packages\keras\backend\tensorflow_backend.py", line 2470, in batch_set_value
get_session().run(assign_ops, feed_dict=feed_dict)
File "C:\Users\JH-06\AppData\Roaming\Python\Python36\site-packages\tensorflow\python\client\session.py", line 950, in run
run_metadata_ptr)
File "C:\Users\JH-06\AppData\Roaming\Python\Python36\site-packages\tensorflow\python\client\session.py", line 1098, in _run
raise RuntimeError('The Session graph is empty. Add operations to the '
RuntimeError: The Session graph is empty. Add operations to the graph before calling run().
May somebody give me any suggestion?

When using wxPython with threads, you need to make sure that you are using a thread-safe method to communicate back to the GUI. There are 3 thread-safe methods you can use with wxPython:
wx.CallAfter
wx.CallLater
wx.PostEvent
Check out either of the following articles for more information
https://www.blog.pythonlibrary.org/2010/05/22/wxpython-and-threads/
https://wiki.wxpython.org/LongRunningTasks

Unable to Start Scheduler

I am new to Python and trying to install Airflow in my Mac, by following this tutorial
While these two commands work fine:
$ airflow initdb
$ airflow webserver -p 8080
The scheduler command (airflow scheduler) throws the following error:
[2020-02-18 13:18:09,012] {scheduler_job.py:1382} ERROR - Exception when executing execute_helper Traceback (most recent call last):
File "/Users/mac/Workspace/airflow/airflow_venv/lib/python3.8/site-packages/airflow/jobs/scheduler_job.py", line 1380, in _execute
self._execute_helper()
File "/Users/mac/Workspace/airflow/airflow_venv/lib/python3.8/site-packages/airflow/jobs/scheduler_job.py", line 1413, in _execute_helper
self.processor_agent.start()
File "/Users/mac/Workspace/airflow/airflow_venv/lib/python3.8/site-packages/airflow/utils/dag_processing.py", line 554, in start
self._process.start()
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/process.py", line 121, in start
self._popen = self._Popen(self)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/context.py", line 224, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/context.py", line 283, in _Popen
return Popen(process_obj)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 32, in __init__
super().__init__(process_obj)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
self._launch(process_obj)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/popen_spawn_posix.py", line 47, in _launch
reduction.dump(process_obj, fp)
File "/Library/Frameworks/Python.framework/Versions/3.8/lib/python3.8/multiprocessing/reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
AttributeError: Can't pickle local object 'SchedulerJob._execute.<locals>.processor_factory'
[2020-02-18 13:18:09,035] {helpers.py:322} INFO - Sending Signals.SIGTERM to GPID None
Traceback (most recent call last): File "/Users/mac/Workspace/airflow/airflow_venv/bin/airflow", line 37, in <module>
args.func(args) File "/Users/mac/Workspace/airflow/airflow_venv/lib/python3.8/site-packages/airflow/utils/cli.py", line 75, in wrapper
return f(*args, **kwargs) File "/Users/mac/Workspace/airflow/airflow_venv/lib/python3.8/site-packages/airflow/bin/cli.py", line 1040, in scheduler
job.run() File "/Users/mac/Workspace/airflow/airflow_venv/lib/python3.8/site-packages/airflow/jobs/base_job.py", line 221, in run
self._execute() File "/Users/mac/Workspace/airflow/airflow_venv/lib/python3.8/site-packages/airflow/jobs/scheduler_job.py", line 1384, in _execute
self.processor_agent.end() File "/Users/mac/Workspace/airflow/airflow_venv/lib/python3.8/site-packages/airflow/utils/dag_processing.py", line 707, in end
reap_process_group(self._process.pid, log=self.log) File "/Users/mac/Workspace/airflow/airflow_venv/lib/python3.8/site-packages/airflow/utils/helpers.py", line 324, in reap_process_group
signal_procs(sig) File "/Users/mac/Workspace/airflow/airflow_venv/lib/python3.8/site-packages/airflow/utils/helpers.py", line 293, in signal_procs
os.killpg(pgid, sig)
TypeError: an integer is required (got type NoneType)

EDIT: Python 3.8 is supported now https://github.com/apache/airflow#requirements. So this answer might not be relevant now.
This due to the Python version you are using. Airflow doesn't support Python 3.8 yet https://github.com/apache/airflow#stable-version-1109.
Downgrade your Python to 3.7 and check.

Maybe there are some compatibility problems?
Using Python 3.6.10 and airflow v1.10.4, I can get airflow running. Maybe you could try some other versions?

This worked for me!
1- Make sure you are using the correct celery version that supports your other packages like RabbitMQ ( as V5 doesn't support AMQP in its usual format), my advice is to use V4.6.X
2-THIS HAS NOTHING TO DO WITH PYTHON VERSION IF YOU ARE USING AIRFLOW V2.0
3- simply make yourself happy with airflow db reset (command may differ if you are using airflow Version X<2.0 )
4- Avoid deleting any dag like you delete a file and use airflow dag ... commands to do so. (it makes up a mess in your environment that you wont like, trust me on this..)
Wish you luck bearing python stuff..

How to port a tf.Session to a tf.train.MonitoredSession call while allowing graph modifications

The code I'm working on is this.
The code uses tf.session call to take in a graph for object detection tasks. Link
My aim here is to profile this code for Nvidia GPUs using the nvtx-plugins-tf to analyze the time taken for different ops. Link to docs
The plugin library provides a function hook for a tf.train.MonitoredSession as given in their example code here.
The code linked above uses tf.session along with a tf.config and when I try to modify the tf.session call to a tf.train.MonitoredSession call, I can't get my code to work and it fails with an error that graph can't be modified. I went through the tensorflow APIs and it turns out that tf.session doesn't support hook callbacks and tf.train.MonitoredSession doesn't support tf_config as a function argument.
Traceback (most recent call last):
File "/home/mayroy13/anaconda3/envs/trt-py36/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/mayroy13/anaconda3/envs/trt-py36/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/mayroy13/Mayank/Mayank/test/tensorrt/tftrt/examples/object_detection/test.py", line 105, in <module>
test(args.test_config_path)
File "/home/mayroy13/Mayank/Mayank/test/tensorrt/tftrt/examples/object_detection/test.py", line 81, in test
**test_config['benchmark_config'])
File "/home/mayroy13/Mayank/Mayank/test/tensorrt/tftrt/examples/object_detection/object_detection.py", line 608, in benchmark_model
tf.import_graph_def(frozen_graph, name='')
File "/home/mayroy13/anaconda3/envs/trt-py36/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/home/mayroy13/anaconda3/envs/trt-py36/lib/python3.6/site-packages/tensorflow/python/framework/importer.py", line 443, in import_graph_def
_ProcessNewOps(graph)
File "/home/mayroy13/anaconda3/envs/trt-py36/lib/python3.6/site-packages/tensorflow/python/framework/importer.py", line 236, in _ProcessNewOps
for new_op in graph._add_new_tf_operations(compute_devices=False): # pylint: disable=protected-access
File "/home/mayroy13/anaconda3/envs/trt-py36/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3751, in _add_new_tf_operations
for c_op in c_api_util.new_tf_operations(self)
File "/home/mayroy13/anaconda3/envs/trt-py36/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3751, in <listcomp>
for c_op in c_api_util.new_tf_operations(self)
File "/home/mayroy13/anaconda3/envs/trt-py36/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3640, in _create_op_from_tf_operation
self._check_not_finalized()
File "/home/mayroy13/anaconda3/envs/trt-py36/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3225, in _check_not_finalized
raise RuntimeError("Graph is finalized and cannot be modified.")
RuntimeError: Graph is finalized and cannot be modified.
Any directions to go in would be appreciated. If there are ways in tensorflow to use hooks in conjunction with tf.session, that will also work for me.

The Reason might be is MonitoredTrainingSession is finalizing(frozen) the graph, you might require to initialize the graph on loop, you can use the function to create new graph on top of loop.
import tensorflow as tf
tf.reset_default_graph()
tf.Graph().as_default()

KeyError when deploying Python Function App on Azure

I'm new to azure and I'm getting this KeyError when deploying my python function on Azure portal, not sure what is the reason.
I have added just one package, "tweepy == 3.8.0" in my requirements.txt and it seems like it is crashing mostly right during it's installation during deployment, And the PySocks package is probably just a dependency for tweepy package.
I have no such issues when the debug it locally. The function runs absolutely fine locally.
How can I resolve this deployment issue?
Error:
There was an error restoring dependencies. Traceback (most recent call last):
File "C:\Users\anjan\AppData\Local\Programs\Python\Python36\lib\runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "C:\Users\anjan\AppData\Local\Programs\Python\Python36\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "C:\Users\anjan\AppData\Roaming\npm\node_modules\azure-functions-core-tools\bin\tools\python\packapp\__main__.py", line
234, in <module>
main()
File "C:\Users\anjan\AppData\Roaming\npm\node_modules\azure-functions-core-tools\bin\tools\python\packapp\__main__.py", line
60, in main
find_and_build_deps(args)
File "C:\Users\anjan\AppData\Roaming\npm\node_modules\azure-functions-core-tools\bin\tools\python\packapp\__main__.py", line
142, in find_and_build_deps
wheel.install(paths, maker)
File "C:\Users\anjan\AppData\Roaming\npm\node_modules\azure-functions-core-tools\bin\tools\python\packapp\distlib\wheel.py",
line 519, in install
row = records[u_arcname]
KeyError: 'PySocks-1.7.0.dist-info/'

"func: pack" task has been a common problem for users. I could solve it by trying a preview feature that is meant to address this: https://github.com/microsoft/vscode-azurefunctions/wiki/Server-Side-Build

Tensorflow Object detection training job fails on Google cloud

I have my Google Storage Bucket in the following manner:
-data
--labels.pbtxt
--train.record
--test.record
-training
--config file
--packages
And my local machine has the data in /tensorflow/models/research/object_detection in the same manner, additionally
-training
--cloud.yml
And I'm running the following command to start job on google cloud ML engine
gcloud ml-engine jobs submit training object_detection_0.1 --job-
dir=gs://{BUCKET NAME}/training --packages dist/object_detection-
0.1.tar.gz,slim/dist/slim-0.1.tar.gz --module-name object_detection.train --
region us-central1 --config /##/##/models/research/object_detection/training
-- --train_dir=gs://{BUCKET NAME}/training --
pipeline_config_path=gs://{BUCKET NAME}/training/config_file.config
Google cloud logs show me the following error.
Traceback (most recent call last):
File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/root/.local/lib/python2.7/site-packages/object_detection/train.py",
line 49, in <module>
from object_detection import trainer
File "/root/.local/lib/python2.7/site-
packages/object_detection/trainer.py", line 33, in <module>
from deployment import model_deploy
ImportError: No module named deployment
replica worker 0,1,2,3 - same error
The replica worker 4 exited with a non-zero status of 1. Termination reason:
Error.
Traceback (most recent call last):
File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/root/.local/lib/python2.7/site-packages/object_detection/train.py",
line 49, in <module>
from object_detection import trainer
File "/root/.local/lib/python2.7/site-
packages/object_detection/trainer.py", line 33, in <module>
from deployment import model_deploy
ImportError: No module named deployment
replica ps 0,1 -same error
The replica ps 2 exited with a non-zero status of 1. Termination reason:
Error.
Traceback (most recent call last):
File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/root/.local/lib/python2.7/site-packages/object_detection/train.py",
line 49, in <module>
from object_detection import trainer
File "/root/.local/lib/python2.7/site-
packages/object_detection/trainer.py", line 33, in <module>
from deployment import model_deploy
ImportError: No module named deployment

I am having the same problem with the deeplab model. It seems they refer to this folder, because it works for me if I placed were it should to be called properly
By the way...I let me know how you solved it.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

AWS Sagemaker KeyError: 'SM_CHANNEL_TRAINING' when tuning hyperparameters - python-3.x

Related

How to solve "ValueError: Cannot create group in read only mode" during loading yolo model?

Unable to Start Scheduler

How to port a tf.Session to a tf.train.MonitoredSession call while allowing graph modifications

KeyError when deploying Python Function App on Azure

Tensorflow Object detection training job fails on Google cloud

Categories

Resources