Tensorflow Object detection training job fails on Google cloud

Tensorflow Object detection training job fails on Google cloud - linux

I have my Google Storage Bucket in the following manner:
-data
--labels.pbtxt
--train.record
--test.record
-training
--config file
--packages
And my local machine has the data in /tensorflow/models/research/object_detection in the same manner, additionally
-training
--cloud.yml
And I'm running the following command to start job on google cloud ML engine
gcloud ml-engine jobs submit training object_detection_0.1 --job-
dir=gs://{BUCKET NAME}/training --packages dist/object_detection-
0.1.tar.gz,slim/dist/slim-0.1.tar.gz --module-name object_detection.train --
region us-central1 --config /##/##/models/research/object_detection/training
-- --train_dir=gs://{BUCKET NAME}/training --
pipeline_config_path=gs://{BUCKET NAME}/training/config_file.config
Google cloud logs show me the following error.
Traceback (most recent call last):
File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/root/.local/lib/python2.7/site-packages/object_detection/train.py",
line 49, in <module>
from object_detection import trainer
File "/root/.local/lib/python2.7/site-
packages/object_detection/trainer.py", line 33, in <module>
from deployment import model_deploy
ImportError: No module named deployment
replica worker 0,1,2,3 - same error
The replica worker 4 exited with a non-zero status of 1. Termination reason:
Error.
Traceback (most recent call last):
File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/root/.local/lib/python2.7/site-packages/object_detection/train.py",
line 49, in <module>
from object_detection import trainer
File "/root/.local/lib/python2.7/site-
packages/object_detection/trainer.py", line 33, in <module>
from deployment import model_deploy
ImportError: No module named deployment
replica ps 0,1 -same error
The replica ps 2 exited with a non-zero status of 1. Termination reason:
Error.
Traceback (most recent call last):
File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/root/.local/lib/python2.7/site-packages/object_detection/train.py",
line 49, in <module>
from object_detection import trainer
File "/root/.local/lib/python2.7/site-
packages/object_detection/trainer.py", line 33, in <module>
from deployment import model_deploy
ImportError: No module named deployment

I am having the same problem with the deeplab model. It seems they refer to this folder, because it works for me if I placed were it should to be called properly
By the way...I let me know how you solved it.

Related

AWS Sagemaker KeyError: 'SM_CHANNEL_TRAINING' when tuning hyperparameters

When I try to use hyperparameters tuning on Sagemaker I get this error:
UnexpectedStatusException: Error for HyperParameterTuning job imageclassif-job-10-21-47-43: Failed. Reason: No training job succeeded after 5 attempts. Please take a look at the training job failures to get more details.
When I look up the logs on CloudWatch all 5 failed training jobs have the same error at the end:
Traceback (most recent call last):
File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
"__main__", mod_spec)
File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/opt/ml/code/train.py", line 117, in <module>
parser.add_argument('--data-dir', type=str, default=os.environ['SM_CHANNEL_TRAINING'])
File "/usr/lib/python3.5/os.py", line 725, in __getitem__
raise KeyError(key) from None
and
KeyError: 'SM_CHANNEL_TRAINING'
The problem is at the Step 4 of the project: https://github.com/petrooha/Deploying-LSTM/blob/main/SageMaker%20Project.ipynb
Would hihgly appreciate any hints on where to look next

In your train.py file, changing the environment variable from
parser.add_argument('--data-dir', type=str, default=os.environ['SM_CHANNEL_TRAINING'])
to
parser.add_argument('--data-dir', type=str, default=os.environ['SM_CHANNEL_TRAIN']) should address the issue.
This is the case with Torch's framework_version 1.3.1 but other versions might also be affected. Here is the link for your reference.

Spyder: An error ocurred while starting the kernel [duplicate]

I have a problem with the Spyder software of Python(version 4.0.1) regarding the running kernels in the IPython Console. Accordingly, I have tried many ways to resolve the issue like running some commands in Anaconda prompt or set the settings to the default mode. I even updated the version of my anaconda and the spyder. Nevertheless, nothing has been changed and the issue still exists.
This is the error I am receiving:
Traceback (most recent call last): File
"C:\ProgramData\Anaconda3\lib\runpy.py", line 193, in
_run_module_as_main "main", mod_spec) File "C:\ProgramData\Anaconda3\lib\runpy.py", line 85, in _run_code
exec(code, run_globals) File
"C:\ProgramData\Anaconda3\lib\site‑packages\spyder_kernels\console__main__.py",
line 11, in start.main() File
"C:\ProgramData\Anaconda3\lib\site‑packages\spyder_kernels\console\start.py",
line 287, in main import_spydercustomize() File
"C:\ProgramData\Anaconda3\lib\site‑packages\spyder_kernels\console\start.py",
line 39, in import_spydercustomize import spydercustomize File
"C:\ProgramData\Anaconda3\lib\site‑packages\spyder_kernels\customize\spydercustomize.py",
line 24, in from IPython.core.getipython import get_ipython File
"C:\ProgramData\Anaconda3\lib\site‑packages\IPython__init__.py", line
56, in from .terminal.embed import embed File
"C:\ProgramData\Anaconda3\lib\site‑packages\IPython\terminal\embed.py",
line 14, in from IPython.core.magic import Magics, magics_class,
line_magic File
"C:\ProgramData\Anaconda3\lib\site‑packages\IPython\core\magic.py",
line 20, in from . import oinspect File
"C:\ProgramData\Anaconda3\lib\site‑packages\IPython\core\oinspect.py",
line 30, in from IPython.lib.pretty import pretty File
"C:\ProgramData\Anaconda3\lib\site‑packages\IPython\lib\pretty.py",
line 82, in import datetime File "C:\Users\mahkam\datetime.py", line
4 ^ SyntaxError: EOF while scanning triple‑quoted string literal

(Spyder maintainer here) You need to rename or remove this file
C:\Users\mahkam\datetime.py
That's because that file is using the same name of Python internal module and that confuses other modules that depend on it.

Looks like you have a quoting error
File "C:\Users\mahkam\datetime.py", line 4 ^ SyntaxError: EOF while scanning triple‑quoted string literal
Check out your datetime.py

Cant start docker `OSError: [Errno 8] Exec format error: '/usr/local/bin/docker-credential-ecr-login'`

I want to start my docker-compose and I always get this error.
Docker Desktop tells me I'm logged in. I also rebooted once and logged in again.
I don't quite understand why that's not possible. If I pull other Docker Containers in another project, everything works.
We dont use paython in our project.
$ docker --version
Docker version 19.03.8, build afacb8b
$ docker-compose --version
docker-compose version 1.25.4, build 8d51620a
$ python --version
Python 3.7.4
macOS Catalina 10.15.3
Here is the stacktrace
> docker-compose up
Pulling mongo (mongo:latest)...
Traceback (most recent call last):
File "site-packages/docker/credentials/store.py", line 80, in _execute
File "subprocess.py", line 411, in check_output
File "subprocess.py", line 488, in run
File "subprocess.py", line 800, in __init__
File "subprocess.py", line 1551, in _execute_child
OSError: [Errno 8] Exec format error: '/usr/local/bin/docker-credential-ecr-login'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "site-packages/docker/auth.py", line 264, in _resolve_authconfig_credstore
File "site-packages/docker/credentials/store.py", line 35, in get
File "site-packages/docker/credentials/store.py", line 104, in _execute
docker.credentials.errors.StoreError: Unexpected OS error "Exec format error", errno=8
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "docker-compose", line 6, in <module>
File "compose/cli/main.py", line 72, in main
File "compose/cli/main.py", line 128, in perform_command
File "compose/cli/main.py", line 1077, in up
File "compose/cli/main.py", line 1073, in up
File "compose/project.py", line 548, in up
File "compose/service.py", line 361, in ensure_image_exists
File "compose/service.py", line 1250, in pull
File "compose/progress_stream.py", line 102, in get_digest_from_pull
File "compose/service.py", line 1215, in _do_pull
File "site-packages/docker/api/image.py", line 396, in pull
File "site-packages/docker/auth.py", line 48, in get_config_header
File "site-packages/docker/auth.py", line 324, in resolve_authconfig
File "site-packages/docker/auth.py", line 235, in resolve_authconfig
File "site-packages/docker/auth.py", line 281, in _resolve_authconfig_credstore
docker.errors.DockerException: Credentials store error: StoreError('Unexpected OS error "Exec format error", errno=8')
[52557] Failed to execute script docker-compose

resetting the docker in docker-hub/settings solved the problem.

KeyError when deploying Python Function App on Azure

I'm new to azure and I'm getting this KeyError when deploying my python function on Azure portal, not sure what is the reason.
I have added just one package, "tweepy == 3.8.0" in my requirements.txt and it seems like it is crashing mostly right during it's installation during deployment, And the PySocks package is probably just a dependency for tweepy package.
I have no such issues when the debug it locally. The function runs absolutely fine locally.
How can I resolve this deployment issue?
Error:
There was an error restoring dependencies. Traceback (most recent call last):
File "C:\Users\anjan\AppData\Local\Programs\Python\Python36\lib\runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "C:\Users\anjan\AppData\Local\Programs\Python\Python36\lib\runpy.py", line 85, in _run_code
exec(code, run_globals)
File "C:\Users\anjan\AppData\Roaming\npm\node_modules\azure-functions-core-tools\bin\tools\python\packapp\__main__.py", line
234, in <module>
main()
File "C:\Users\anjan\AppData\Roaming\npm\node_modules\azure-functions-core-tools\bin\tools\python\packapp\__main__.py", line
60, in main
find_and_build_deps(args)
File "C:\Users\anjan\AppData\Roaming\npm\node_modules\azure-functions-core-tools\bin\tools\python\packapp\__main__.py", line
142, in find_and_build_deps
wheel.install(paths, maker)
File "C:\Users\anjan\AppData\Roaming\npm\node_modules\azure-functions-core-tools\bin\tools\python\packapp\distlib\wheel.py",
line 519, in install
row = records[u_arcname]
KeyError: 'PySocks-1.7.0.dist-info/'

"func: pack" task has been a common problem for users. I could solve it by trying a preview feature that is meant to address this: https://github.com/microsoft/vscode-azurefunctions/wiki/Server-Side-Build

vagrant dcos install cassandra error

I followed the official vagrant-dcos instruction to install cassandra with minimal setup, by running command below, and got errors. Any idea?
dcos package install --options=examples/oinker/pkg-cassandra.json cassandra --yes
see error below:
Traceback (most recent call last):
File "cli/dcoscli/subcommand.py", line 101, in run_and_capture
File "cli/dcoscli/package/main.py", line 22, in main
File "cli/dcoscli/util.py", line 22, in wrapper
File "cli/dcoscli/package/main.py", line 36, in _main
File "dcos/cmds.py", line 43, in execute
File "cli/dcoscli/package/main.py", line 322, in _install
File "dcos/packagemanager.py", line 177, in get_package_version
File "dcos/packagemanager.py", line 359, in __init__
File "cli/env/lib/python3.5/site-packages/requests/models.py", line 866, in json
File "json/__init__.py", line 319, in loads
File "json/decoder.py", line 339, in decode
File "json/decoder.py", line 357, in raw_decode
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Seems to be issue on cosmos. I restarted whole vagrant and it worked fine.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Tensorflow Object detection training job fails on Google cloud - linux

I am having the same problem with the deeplab model. It seems they refer to this folder, because it works for me if I placed were it should to be called properly By the way...I let me know how you solved it.

Related

AWS Sagemaker KeyError: 'SM_CHANNEL_TRAINING' when tuning hyperparameters

Spyder: An error ocurred while starting the kernel [duplicate]

Cant start docker `OSError: [Errno 8] Exec format error: '/usr/local/bin/docker-credential-ecr-login'`

KeyError when deploying Python Function App on Azure

vagrant dcos install cassandra error

Categories

Resources