Sagemaker: continual fit error after one failure to fit - python-3.x

I am implementing some Sagemaker SKLearn examples:
https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-python-sdk/scikit_learn_iris
https://github.com/awslabs/amazon-sagemaker-examples/tree/master/sagemaker-python-sdk/scikit_learn_randomforest
I can successfully run them and all is happy. However, if I deliberately introduce a bug into the training script .py file, for example by adding
import boto3
which fails because boto3 is not installed on the training docker image, then for the iris example I get the error
UnexpectedStatusException: Error for Training job sagemaker-scikit-learn-2019-09-30-05-13-53-184: Failed. Reason: AlgorithmError: ExecuteUserScriptError:
Command "/usr/bin/python3 -m train --max_leaf_nodes 30"
and a similar error for the random_forest example. That is all fine.
What I do not understand is that when I remove the offending line of code from the script, returning to precisely the code which had already ran successfully, I get the same error.
I have tried stopping and re-starting the notebook instance but the error remains.

Related

How to fix "from caffe.proto import caffe_pb2" error in Google Colab

My teammates and I get the following error when we try to run the code for our class project. We used this colab notebook to install caffe before, but still ran into the error: https://colab.research.google.com/github/Huxwell/caffe-colab/blob/main/caffe_details.ipynb#scrollTo=vCy0jVs6Bo7G. We tried every possible option, but still no luck. :// Anyone got an idea of how can we resolve this error?
from caffe.proto import caffe_pb2

constant Download error trying to download mnist

The error keeps saying:
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz 16384/11490434 [..............................] - ETA: 10s
and keeps doing this continuously, The code I wrote is this:
import tensorflow as tf
(x_train, y_train), (x_test, y_test)= tf.datasets.mnist.load_data()
print(x_train[0])
I am trying to print the array of an image using this command, I know I can do it other ways, but I am trying to use "keras.datasets.mnist.load_data()" specifically. What does this error mean?
My tensorflow version is 2.6.1 and python is 3.9.7
The error is shown in the image:
Ok I figured out what went wrong, it is because the file was not downloading permanently in its folder using python's IDLE. When I had used Jupyter Notebook somehow it installed it permanently. In other words, after python's IDLE downloads the file, the downloading phase will stop and the file will go away and python will continuously keep trying to look for it inorder to completely execute the command. Since it could not, you can say python was chasing its tail.
If you were to manually download it and use the same command, python's IDLE will be able to use it. However if you messed with the manually downloaded file by changing the way the file should be opened and you use another command such as mnist.load_data("mnist"), you will have to find that file somehow like how I did, by downloading anaconda and using jupyter Notebook or something similar.
The reason why python's IDLE is doing that I have no clue, but if you can somehow find a way to install it permanently the program will work.

Joblib doesn't work in Jupyter lab

Firstly I use the joblib.dump command in my neural network training notebook, in order to dump my pipeline that is fitted on my input data:
joblib.dump(prepareinput, "prepareinput.save")
Then when I try to load the data in a different notebook in Jupyter lab using:
prepareinput = joblib.load('prepareinput.save')
It will return the following error:
No such file or directory: 'prepareinput.save'.
Whilst the file is present in the directory. Then when I export the notebook as script, it will run perfectly. I have also tried using the full path "~/..linktofile../prepareinput.save"
Does anyone have an idea on how to fix this issue?

Keras CNTK stops working when I train the model

I have changed keras backend to "cntk" in keras.json file.
When I execute my python file execute
it stops working.
stop-working image
But when I use tensorflow/theano is working normally.
Why?
----------update
image-pycharm
When pycharm Found images(image-pycharm) ,it stop suddenly.
And then I try to use cmd to execute my python file.
The result is "The system stop working"

Cannot call Keras from Slurm

I want to use Keras on a cluster using Slurm as the job engine.
If I open a terminal and run the following commands, everything works fine:
$python
>>> import tensorflow
>>> import keras
However, if I place import tensorflow and import keras in a Python file that I then call from slurm :
srun [bunch of parameters for my cluster] python mypythonfile.py
Then I get the following error: ImportError: No module named keras.
Is there something specific to do when using Keras in a cluster with Slurm?
I'm reiterating my comment just to show that this question has been answered:
It's common to module load xxxx where xxxx is a different Python installation than the default. You usually stick this in your .bash_profile or a similar file to make sure that you have the Python version you want, always available.
When you submit a job with Slurm, it doesn't call your .bash_profile. It just executes the script. You need to make sure that loading your Python distribution is part of that script.
I also met this problem and solved it recently. As the above answer mentioned, slurm command won't excute .bash_profile and the reason that you can import kears directly in python is the environment setting sentence in the .bash_profile. As a result, I added export PATH="/n/home01/username/miniconda3/bin:$PATH" into my sbatch file and then everything worked.

Resources