SageMaker smdebug generating AttributeError with Pytorch container - pytorch

I am training a PyTorch model. I am able to run the training script successfully on GPU instances (for instance on EC2 instances with pytorch_p36 conda evnrionment activated). Here is the script for reference:
https://github.com/zhanghang1989/PyTorch-Encoding/blob/master/experiments/segmentation/train.py
I adapted the script to run under SageMaker but there i get this error generated by something smdebug is doing:
File "ss_training_entrypoint.py", line 400, in <module>
trainer.training(epoch)
File "ss_training_entrypoint.py", line 315, in training
loss = self.criterion(outputs, target)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 543, in __call__
result = self.forward(*input, **kwargs)
File "/opt/conda/lib/python3.6/site-packages/encoding/parallel.py", line 132, in forward
outputs = _criterion_parallel_apply(replicas, inputs, targets, kwargs)
File "/opt/conda/lib/python3.6/site-packages/encoding/parallel.py", line 185, in _criterion_parallel_apply
raise output
File "/opt/conda/lib/python3.6/site-packages/encoding/parallel.py", line 160, in _worker
output = module(*(input + target), **kwargs)
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 545, in __call__
hook_result = hook(self, input, result)
File "/opt/conda/lib/python3.6/site-packages/smdebug/pytorch/hook.py", line 156, in forward_hook
module_name = module._module_name
File "/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py", line 587, in __getattr__
type(self).__name__, name))
AttributeError: 'SegmentationLosses' object has no attribute '_module_name'
Does anyone know why this is happening and how to fix it?
Alternatively, is it possible to disable the smdebug hooks while not losing the SageMaker functionalities (i.e. Having the model trained and usable)?
Thanks very much!

Related

IndexError: too many indices for tensor of dimension 3

I am trying to convert Detectron2 Model into an onnx model format (pytorch to onnx). This is the code I am using (note I am using all necessary imports and to my understanding my installations are correct). Here is my code:
cfg = get_cfg()
cfg.MODEL.DEVICE = 'cpu'
cfg.merge_from_file(model_zoo.get_config_file("COCO-Detection/faster_rcnn_R_50_FPN_3x.yaml"))
model = build_model(cfg)
aug = T.ResizeShortestEdge([cfg.INPUT.MIN_SIZE_TEST, cfg.INPUT.MIN_SIZE_TEST],
cfg.INPUT.MAX_SIZE_TEST)
height, width = im.shape[:2]
image = aug.get_transform(im).apply_image(im)
image = torch.as_tensor(image.astype("float32").transpose(2, 0, 1))
inputs2 = {"image": image}
print(image.shape)
output = export_onnx_model(model, image)
print("output:", output)
onnx.save(output, "/home/ecoation/Documents/model/deploy.onnx")
And here is the bug I am getting:
Traceback (most recent call last):
File "/home/ecoation/Documents/detectronCode.py", line 57, in <module>
output = export_onnx_model(model, image)
File "/home/ecoation/anaconda3/envs/detectron2/lib/python3.8/site-packages/detectron2/export/caffe2_export.py", line 49, in export_onnx_model
torch.onnx.export(
File "/home/ecoation/anaconda3/envs/detectron2/lib/python3.8/site-packages/torch/onnx/__init__.py", line 272, in export
return utils.export(model, args, f, export_params, verbose, training,
File "/home/ecoation/anaconda3/envs/detectron2/lib/python3.8/site-packages/torch/onnx/utils.py", line 88, in export
_export(model, args, f, export_params, verbose, training, input_names, output_names,
File "/home/ecoation/anaconda3/envs/detectron2/lib/python3.8/site-packages/torch/onnx/utils.py", line 694, in _export
_model_to_graph(model, args, verbose, input_names,
File "/home/ecoation/anaconda3/envs/detectron2/lib/python3.8/site-packages/torch/onnx/utils.py", line 457, in _model_to_graph
graph, params, torch_out, module = _create_jit_graph(model, args,
File "/home/ecoation/anaconda3/envs/detectron2/lib/python3.8/site-packages/torch/onnx/utils.py", line 420, in _create_jit_graph
graph, torch_out = _trace_and_get_graph_from_model(model, args)
File "/home/ecoation/anaconda3/envs/detectron2/lib/python3.8/site-packages/torch/jit/_trace.py", line 116, in wrapper
outs.append(self.inner(*trace_inputs))
File "/home/ecoation/anaconda3/envs/detectron2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 887, in _call_impl
result = self._slow_forward(*input, **kwargs)
File "/home/ecoation/anaconda3/envs/detectron2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 860, in _slow_forward
result = self.forward(*input, **kwargs)
File "/home/ecoation/anaconda3/envs/detectron2/lib/python3.8/site-packages/detectron2/modeling/meta_arch/rcnn.py", line 150, in forward
return self.inference(batched_inputs)
File "/home/ecoation/anaconda3/envs/detectron2/lib/python3.8/site-packages/detectron2/modeling/meta_arch/rcnn.py", line 203, in inference
images = self.preprocess_image(batched_inputs)
File "/home/ecoation/anaconda3/envs/detectron2/lib/python3.8/site-packages/detectron2/modeling/meta_arch/rcnn.py", line 228, in preprocess_image
images = [self._move_to_current_device(x["image"]) for x in batched_inputs]
File "/home/ecoation/anaconda3/envs/detectron2/lib/python3.8/site-packages/detectron2/modeling/meta_arch/rcnn.py", line 228, in <listcomp>
images = [self._move_to_current_device(x["image"]) for x in batched_inputs]
IndexError: too many indices for tensor of dimension 2
Feel free to take a look, any help in turning this into a onnx format model is much appreciated. If it helps, I am on Ubuntu 18.04 and all I'm trying to do is take a pretrained simple detectron2 model from regular to onnx format. I'm hoping to get it so I can do custom, but any help with pretrained would be much appreciated.
Edit:
Image size is this :
torch.Size([3, 800, 1067])

RuntimeError: can't start new thread

My objective was to use Scikit-Optimize library in python to minimize the function value in order to find the optimized parameters for xgboost model. The process involve running the model with different random parameters for 5,000 times.
However, it seems that the loop stopped at some point and gave me a RuntimeError: can't start new thread. I am using ubuntu 20.04 and is running python 3.8.5, Scikit-Optimize version is 0.8.1. I ran the same code in windows 10 and it appears that I do not encounter this RuntimeError, however, the code runs much more slower.
I think I may need a threadpool to solve this issue but after searching through the web and I had no luck on finding a solution to implement the threadpool.
Below is a simplified version of the codes:
#This function will be passed to Scikit-Optimize to find the optimized parameters (Params)
def find_best_xgboost_para(params):`
#Defines the parameters that I want to optimize
learning_rate,gamma,max_depth,min_child_weight,reg_alpha,reg_lambda,subsample,max_bin,num_parallel_tree,colsamp_lev,colsamp_tree,StopSteps\
=float(params[0]),float(params[1]),int(params[2]),int(params[3]),\
int(params[4]),int(params[5]),float(params[6]),int(params[7]),int(params[8]),float(params[9]),float(params[10]),int(params[11])
xgbc=XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=colsamp_lev,
colsample_bytree=colsamp_tree, gamma=gamma, learning_rate=learning_rate, max_delta_step=0,
max_depth=max_depth, min_child_weight=min_child_weight, missing=None, n_estimators=nTrees,
objective='binary:logistic',random_state=101, reg_alpha=reg_alpha,
reg_lambda=reg_lambda, scale_pos_weight=1,seed=101,
subsample=subsample,importance_type='gain',gpu_id=GPUID,max_bin=max_bin,
tree_method='gpu_hist',num_parallel_tree=num_parallel_tree,predictor='gpu_predictor',verbosity=0,\
refresh_leaf=0,grow_policy='depthwise',process_type=TreeUpdateStatus,single_precision_histogram=SinglePrecision)
tscv = TimeSeriesSplit(CV_nSplit)
error_data=xgboost.cv(xgbc.get_xgb_params(), CVTrain, num_boost_round=CVBoostRound, nfold=None, stratified=False, folds=tscv, metrics=(), \
obj=None, feval=f1_eval, maximize=False, early_stopping_rounds=StopSteps, fpreproc=None, as_pandas=True, \
verbose_eval=True, show_stdv=True, seed=101, shuffle=shuffle_trig)
eval_set = [(X_train, y_train), (X_test, y_test)]
xgbc.fit(X_train, y_train, eval_metric=f1_eval, early_stopping_rounds=StopSteps, eval_set=eval_set,verbose=True)
xgbc_predictions=xgbc.predict(X_test)
error=(1-metrics.f1_score(y_test, xgbc_predictions,average='macro'))
del xgbc
return error
#Define the range of values that Scikit-Optimize can choose from to find the optimized parameters
lr_low, lr_high=float(XgParamDict['lr_low']), float(XgParamDict['lr_high'])
gama_low, gama_high=float(XgParamDict['gama_low']), float(XgParamDict['gama_high'])
depth_low, depth_high=int(XgParamDict['depth_low']), int(XgParamDict['depth_high'])
child_weight_low, child_weight_high=int(XgParamDict['child_weight_low']), int(XgParamDict['child_weight_high'])
alpha_low,alpha_high=int(XgParamDict['alpha_low']),int(XgParamDict['alpha_high'])
lambda_low,lambda_high=int(XgParamDict['lambda_low']),int(XgParamDict['lambda_high'])
subsamp_low,subsamp_high=float(XgParamDict['subsamp_low']),float(XgParamDict['subsamp_high'])
max_bin_low,max_bin_high=int(XgParamDict['max_bin_low']),int(XgParamDict['max_bin_high'])
num_parallel_tree_low,num_parallel_tree_high=int(XgParamDict['num_parallel_tree_low']),int(XgParamDict['num_parallel_tree_high'])
colsamp_lev_low,colsamp_lev_high=float(XgParamDict['colsamp_lev_low']),float(XgParamDict['colsamp_lev_high'])
colsamp_tree_low,colsamp_tree_high=float(XgParamDict['colsamp_tree_low']),float(XgParamDict['colsamp_tree_high'])
StopSteps_low,StopSteps_high=float(XgParamDict['StopSteps_low']),float(XgParamDict['StopSteps_high'])
#Pass the target function (find_best_xgboost_para) as well as parameter ranges to Scikit-Optimize, 'res' will be an array of values that will need to be pass to another function
res=gbrt_minimize(find_best_xgboost_para,[(lr_low,lr_high),(gama_low, gama_high),(depth_low,depth_high),(child_weight_low,child_weight_high),\
(alpha_low,alpha_high),(lambda_low,lambda_high),(subsamp_low,subsamp_high),(max_bin_low,max_bin_high),\
(num_parallel_tree_low,num_parallel_tree_high),(colsamp_lev_low,colsamp_lev_high),(colsamp_tree_low,colsamp_tree_high),\
(StopSteps_low,StopSteps_high)],random_state=101,n_calls=5000,n_random_starts=1500,verbose=True,n_jobs=-1)
Below is the error message:
Traceback (most recent call last):
File "/home/FactorOpt.py", line 91, in <module>Opt(**FactorOptDict)
File "/home/anaconda3/lib/python3.8/site-packages/skopt/optimizer/gbrt.py", line 179, in gbrt_minimize return base_minimize(func, dimensions, base_estimator,
File "/home/anaconda3/lib/python3.8/site-packages/skopt/optimizer/base.py", line 301, in base_minimize
next_y = func(next_x)
File "/home/anaconda3/lib/python3.8/modelling/FactorOpt.py", line 456, in xgboost_opt
res=gbrt_minimize(find_best_xgboost_para,[(lr_low,lr_high),(gama_low, gama_high),(depth_low,depth_high),(child_weight_low,child_weight_high),\
File "/home/anaconda3/lib/python3.8/site-packages/skopt/optimizer/gbrt.py", line 179, in gbrt_minimize
return base_minimize(func, dimensions, base_estimator,
File "/home/anaconda3/lib/python3.8/site-packages/skopt/optimizer/base.py", line 302, in base_minimize
result = optimizer.tell(next_x, next_y)
File "/home/anaconda3/lib/python3.8/site-packages/skopt/optimizer/optimizer.py", line 493, in tell
return self._tell(x, y, fit=fit)
File "/home/anaconda3/lib/python3.8/site-packages/skopt/optimizer/optimizer.py", line 536, in _tell
est.fit(self.space.transform(self.Xi), self.yi)
File "/home/anaconda3/lib/python3.8/site-packages/skopt/learning/gbrt.py", line 85, in fit
self.regressors_ = Parallel(n_jobs=self.n_jobs, backend='threading')(
File "/home/anaconda3/lib/python3.8/site-packages/joblib/parallel.py", line 1048, in __call__
if self.dispatch_one_batch(iterator):
File "/home/anaconda3/lib/python3.8/site-packages/joblib/parallel.py", line 866, in dispatch_one_batch
self._dispatch(tasks)
File "/home/anaconda3/lib/python3.8/site-packages/joblib/parallel.py", line 784, in _dispatch
job = self._backend.apply_async(batch, callback=cb)
File "/home/anaconda3/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 252, in apply_async
return self._get_pool().apply_async(
File "/home/anaconda3/lib/python3.8/site-packages/joblib/_parallel_backends.py", line 407, in _get_pool
self._pool = ThreadPool(self._n_jobs)
File "/home/anaconda3/lib/python3.8/multiprocessing/pool.py", line 925, in __init__
Pool.__init__(self, processes, initializer, initargs)
File "/home/anaconda3/lib/python3.8/multiprocessing/pool.py", line 232, in __init__
self._worker_handler.start()
File "/home/anaconda3/lib/python3.8/threading.py", line 852, in start
_start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread

TensorFlow 2.1 using TPUEstimator: RuntimeError: All tensors outfed from TPU should preserve batch size dimension, but got scalar Tensor

I just converted an existing project from TF 1.14 to TF 2.1 which uses the TPUEstimator API. After making the conversion, testing locally (i.e. use_tpu=False) runs successfully. However, I am getting errors when running on Google Cloud TPU (i.e. use_tpu=True).
Note: This is in the context of the AdaNet AutoML framework (v0.8.0), although I suspect this may be a general TPUEstimator-related error, as the errors appear to originate in the tpu_estimator.py and error_handling.py scripts seen in the Traceback below:
File "/usr/local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3032, in train
rendezvous.record_error('training_loop', sys.exc_info())
File "/usr/local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 81, in record_error
if value and value.op and value.op.type == _CHECK_NUMERIC_OP_NAME:
AttributeError: 'RuntimeError' object has no attribute 'op'
During handling of the above exception, another exception occurred:
File "workspace/trainer/train.py", line 331, in <module>
main(args=parsed_args)
File "workspace/trainer/train.py", line 177, in main
run_config=run_config)
File "workspace/trainer/train.py", line 68, in run_experiment
estimator.train(input_fn=train_input_fn, max_steps=total_train_steps)
File "/usr/local/lib/python3.6/site-packages/adanet/core/estimator.py", line 853, in train
saving_listeners=saving_listeners)
File "/usr/local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3035, in train
rendezvous.raise_errors()
File "/usr/local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/error_handling.py", line 143, in raise_errors
six.reraise(typ, value, traceback)
File "/usr/local/lib/python3.6/site-packages/six.py", line 703, in reraise
raise value
File "/usr/local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3030, in train
saving_listeners=saving_listeners)
File "/usr/local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 374, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1164, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1194, in _train_model_default
features, labels, ModeKeys.TRAIN, self.config)
File "/usr/local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2857, in _call_model_fn
config)
File "/usr/local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1152, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "/usr/local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 3186, in _model_fn
host_ops = host_call.create_tpu_hostcall()
File "/usr/local/lib/python3.6/site-packages/tensorflow_estimator/python/estimator/tpu/tpu_estimator.py", line 2226, in create_tpu_hostcall
'dimension, but got scalar {}'.format(dequeue_ops[i][0]))
RuntimeError: All tensors outfed from TPU should preserve batch size dimension, but got scalar Tensor("OutfeedDequeueTuple:1", shape=(), dtype=int64, device=/job:tpu_worker/task:0/device:CPU:0)'
The previous version of the project using TF 1.14 runs both locally and on TPU using TPUEstimator without issues. Is there something obvious I am potentially missing for the conversion over to TF 2.1 when using TPUEstimator API?
Have you applied the following:
dataset = ...
dataset = dataset.apply(tf.contrib.data.batch_and_drop_remainder(batch_size))
this potentially drops the last few samples from a file to ensure that every batch has a static shape of batch_size, which is required when training on TPUs.

Pytorch model weight type conversion

I'm trying to do inference on FlowNet2-C model loading from file.
However, I met some data type problem. How can I resolve it?
Source code
FlowNet2-C pre-trained model
$ python main.py
Initializing Datasets
[0.000s] Loading checkpoint '/notebooks/data/model/FlowNet2-C_checkpoint.pth.tar'
[1.293s] Loaded checkpoint '/notebooks/data/model/FlowNet2-C_checkpoint.pth.tar' (at epoch 0)
(1L, 6L, 384L, 512L)
<class 'torch.autograd.variable.Variable'>
[1.642s] Operation failed
Traceback (most recent call last):
File "main.py", line 102, in <module>
main()
File "main.py", line 98, in main
summary(input_size, model)
File "main.py", line 61, in summary
model(x)
File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 357, in __call__
result = self.forward(*input, **kwargs)
File "/notebooks/data/vinet/FlowNetC.py", line 75, in forward
out_conv1a = self.conv1(x1)
File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 357, in __call__
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/container.py", line 67, in forward
input = module(input)
File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 357, in __call__
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/conv.py", line 282, in forward
self.padding, self.dilation, self.groups)
File "/usr/local/lib/python2.7/dist-packages/torch/nn/functional.py", line 90, in conv2d
return f(input, weight, bias)
RuntimeError: Input type (CUDAFloatTensor) and weight type (CPUFloatTensor) should be the same
Maybe that is because your model and input x to the model has different data types. It seems that the model parameters have been moved to GPU, but your input x is on GPU.
You can try to use model.cuda() after line 94, which will put the model on the GPU. Then the error should disappear.

AttributeError : lower not found

I am doing Document Classification and obtained accuracy upto 76%. And while predicting the document category i did following one
doc_clf.predict(tf_idf.transform((count_vect.transform([r'document']))))
and i get the following error:
File "/usr/local/lib/python3.5/dist- packages/sklearn/utils/metaestimators.py", line 115, in <lambda>
out = lambda *args, **kwargs: self.fn(obj, *args, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/sklearn/pipeline.py", line 306, in predict
Xt = transform.transform(Xt)
File "/usr/local/lib/python3.5/dist-packages/sklearn/feature_extraction/text.py", line 923, in transform
_, X = self._count_vocab(raw_documents, fixed_vocab=True)
File "/usr/local/lib/python3.5/dist-packages/sklearn/feature_extraction/text.py", line 792, in _count_vocab
for feature in analyze(doc):
File "/usr/local/lib/python3.5/dist-packages/sklearn/feature_extraction/text.py", line 266, in <lambda>
tokenize(preprocess(self.decode(doc))), stop_words)
File "/usr/local/lib/python3.5/dist-packages/sklearn/feature_extraction/text.py", line 232, in <lambda>
return lambda x: strip_accents(x.lower())
File "/usr/local/lib/python3.5/dist-packages/scipy/sparse/base.py", line 647, in __getattr__
raise AttributeError(attr + " not found")
How do i correct this error ? And any other way to improve the accuracy further?
I share link to review full code
Full Code
In your code, doc_clf is a pipeline. So the tf_idf.transform() and count_vect.transform() will be handled automatically by the pipeline.
You should only call
category = doc_clf.predict([r'document'])
As this document passes through the pipeline, it will be automatically transformed by the CountVectorizer and TfidfTransformer.

Resources