CUDA Illegal Memory Access on PyTorch 1.3 - pytorch

#staticmethod
def backward(ctx, grad_output):
grad_label = grad_output.clone()
num_ft = grad_output.shape[0]
# grad_label.data.resize_(num_ft, 32, 41)
lin_indices_3d, lin_indices_2d = ctx.saved_variables
num_ind = lin_indices_3d.data[0]
grad_label.data.view(num_ft, -1).index_copy_(1, lin_indices_2d.data[1:1 + num_ind],
torch.index_select(grad_output.data.contiguous().view(num_ft, -1),
1, lin_indices_3d.data[1:1 + num_ind]))
# raw_input('sdflkj')
return grad_label, None, None, None
This is the code snippet I am trying to run on PyTorch. However, I strangely keep getting the error of Illegal Memory Access. When I tried to use a Debugger and try and find the culprit, I would see
As such I am not certain what is wrong here. The same code was running on PyTorch 0.4 and now I am trying to run it on PyTorch 1.3 and it does not work. The same error remains on versions 1.4 and 1.5 which are the latest versions for the framework. Any help shall be highly appreciated.

It turns out this was an error with PyTorch framework itself. They are going to correct it with the version 1.6
Here is the github link
https://github.com/pytorch/pytorch/issues/34450

Related

module 'torch.cuda' has no attribute 'memory_summary'

I'm trying to measure the available space on each of my GPUs using torch.cuda module. However it is returning me the following error.
module 'torch.cuda' has no attribute 'memory_summary'
My code is below
if torch.cuda.is_available():
for i in range(torch.cuda.device_count()):
print(torch.cuda.get_device_name(i))
a = torch.cuda.memory_summary(torch.device('cuda:{}'.format(i)))
print(a)
Similarly memory_stats, mem_get_info and memory_reserved all are failing.
torch.cuda.memory_summary is introduced in Pytorch 1.4.0. So if your torch install is older than that you won't be able to use it.

tensorboard images are not getting updated

I have the following code for my reinforcement learning program:
tbCallBack = callbacks.TensorBoard(log_dir = log_dir,
histogram_freq = 1, update_freq = 'epoch', write_graph = True, write_images = True)
The model fits data soon after getting data for training:
model.fit(x = np.vstack(x_train),
y = np.vstack(y_train),
callbacks = [tbCallBack],
verbose = 1, sample_weight = s_t)
I see my SCALARS, DISTRIBUTIONS, and HISTOGRAMS all getting updated but not the IMAGES in the tensorboard...
I see only one image, but no updates....Can you please let me know where the problem is ?
Here is the version information:
tensorboard 2.3.0 pyh4dce500_0
tensorboard-plugin-wit 1.6.0 py_0
keras 2.4.3 0
keras-base 2.4.3 py_0
I am not sure if my answer will help, but as far as my experience is concerned,
Tensorboard doesn't seem to update images well when using Windows, regardeless of the web browser being used.
Such issue seems to exist even in the latest version (2.6.0) of Tensorboard and latest version (2.6.0) of Tensorflow.
Please check the OS upon which you are running your reinforcement learning model. If you were using Windows, I suggest you port the Python code to Linux-based OSes like Ubuntu.

RuntimeError: "exp" not implemented for 'torch.LongTensor'

I am following this tutorial: http://nlp.seas.harvard.edu/2018/04/03/attention.html
to implement the Transformer model from the "Attention Is All You Need" paper.
However I am getting the following error :
RuntimeError: "exp" not implemented for 'torch.LongTensor'
This is the line, in the PositionalEnconding class, that is causing the error:
div_term = torch.exp(torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model))
When it is being constructed here:
pe = PositionalEncoding(20, 0)
Any ideas?? I've already tried converting this to perhaps a Tensor Float type, but this has not worked.
I've even downloaded the whole notebook with accompanying files and the error seems to persist in the original tutorial.
Any ideas what may be causing this error?
Thanks!
I happened to follow this tutorial too.
For me I just got the torch.arange to generate float type tensor
from
position = torch.arange(0, max_len).unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model, 2) * -(math.log(10000.0) / d_model))
to
position = torch.arange(0., max_len).unsqueeze(1)
div_term = torch.exp(torch.arange(0., d_model, 2) * -(math.log(10000.0) / d_model))
Just a simple fix. But now it works for me. It is possible that the torch exp and sin previously support LongTensor but not anymore (not very sure about it).
It seems like torch.arange returns a LongTensor, try torch.arange(0.0, d_model, 2) to force torch to return a FloatTensor instead.
The suggestion given by #shai worked for me. I modified the init method of PositionalEncoding by using 0.0 in two spots:
position = torch.arange(0.0, max_len).unsqueeze(1)
div_term = torch.exp(torch.arange(0.0, d_model, 2) * -(math.log(10000.0) / d_model))
For me, installing pytorch == 1.7.1 solved the problem.
Like Rubens said, in the higher version of Pytorch, you don't need to worry about this stuff. I can easily run it on my desktop's 1.8.0 Pytorch, but failed to go through it in my server's 1.2.0 Pytorch. There is something incompatible between different versions.

Tensorflow serving: request fails with object has no attribute 'unary_unary

I'm building a CNN text classifier using TensorFlow which I want to load in tensorflow-serving and query using the serving apis. When I call the Predict() method on the grcp stub I receive this error: AttributeError: 'grpc._cython.cygrpc.Channel' object has no attribute 'unary_unary'
What I've done to date:
I have successfully trained and exported a model suitable for serving (i.e., the signatures are verified and using tf.Saver I can successfully return a prediction). I can also load the model in tensorflow_model_server without error.
Here is a snippet of the client code (simplified for readability):
with tf.Session() as sess:
host = FLAGS.server
channel = grpc.insecure_channel('localhost:9001')
stub = prediction_service_pb2.beta_create_PredictionService_stub(channel)
request = predict_pb2.PredictRequest()
request.model_spec.name = 'predict_text'
request.model_spec.signature_name = 'predict_text'
x_text = ["space"]
# restore vocab processor
# then create a ndarray with transform_fit using the vocabulary
vocab = learn.preprocessing.VocabularyProcessor.restore('/some_path/model_export/1/assets/vocab')
x = np.array(list(vocab.fit_transform(x_text)))
# data
temp_data = tf.contrib.util.make_tensor_proto(x, shape=[1, 15], verify_shape=True)
request.inputs['input'].CopyFrom(tf.contrib.util.make_tensor_proto(x, shape=[1, 15], verify_shape=True))
# get classification prediction
result = stub.Predict(request, 5.0)
Where I'm bending the rules: I am using tensorflow-serving-apis in Python 3.5.3 when pip install is not officially supported. Various posts (example: https://github.com/tensorflow/serving/issues/581) have reported that using tensorflow-serving with Python 3 has been successful. I have downloaded tensorflow-serving-apis package from pypi (https://pypi.python.org/pypi/tensorflow-serving-api/1.5.0)and manually pasted into the environment.
Versions: tensorflow: 1.5.0, tensorflow-serving-apis: 1.5.0, grpcio: 1.9.0rc3, grcpio-tools: 1.9.0rcs, protobuf: 3.5.1 (all other dependency version have been verified but are not included for brevity -- happy to add if they have utility)
Environment: Linux Mint 17 Qiana; x64, Python 3.5.3
Investigations:
A github issue (https://github.com/GoogleCloudPlatform/google-cloud-python/issues/2258) indicated that a historical package triggered this error was related to grpc beta.
What data or learning or implementation am I missing?
beta_create_PredictionService_stub() is deprecated. Try this:
from tensorflow_serving.apis import prediction_service_pb2_grpc
...
stub = prediction_service_pb2_grpc.PredictionServiceStub(channel)
Try to use grpc.beta.implementations.insecure_channel instead of grpc.insecure_channel.
See example code here.

TensorFlow 0.12 tutorials produce warning: "Rank of input Tensor should be the same as output_rank for column

I have some experience with writing machine learning programs in python, but I'm new to TensorFlow and am checking it out. My dev environment is a lubuntu 14.04 64-bit virtual machine. I've created a python 3.5 conda environment from miniconda and installed TensorFlow 0.12 and its dependencies. I began trying to run some example code from TensorFlow's tutorials and encountered this warning when calling fit() in the boston.py example for input functions: source.
WARNING:tensorflow:Rank of input Tensor (1) should be the same as
output_rank (2) for column. Will attempt to expand dims. It is highly
recommended that you resize your input, as this behavior may change.
After some searching in Google, I found other people encountered this same warning:
https://github.com/tensorflow/tensorflow/issues/6184
https://github.com/tensorflow/tensorflow/issues/5098
Tensorflow - Boston Housing Data Tutorial Errors
However, they also experienced errors which prevent code execution from completing. In my case, the code executes with the above warning. Unfortunately, I couldn't find a single answer in those links regarding what caused the warning and how to fix the warning. They all focused on the error. How does one remove the warning? Or is the warning safe to ignore?
Cheers!
Extra info, I also see the following warnings when running the aforementioned boston.py example.
WARNING:tensorflow:*******************************************************
WARNING:tensorflow:TensorFlow's V1 checkpoint format has been
deprecated. WARNING:tensorflow:Consider switching to the more
efficient V2 format: WARNING:tensorflow:
'tf.train.Saver(write_version=tf.train.SaverDef.V2)'
WARNING:tensorflow:now on by default.
WARNING:tensorflow:*******************************************************
and
WARNING:tensorflow:From
/home/kade/miniconda3/envs/tensorflow/lib/python3.5/site-packages/tensorflow/contrib/learn/python/learn/estimators/dnn_linear_combined.py:1053
in predict.: calling BaseEstimator.predict (from
tensorflow.contrib.learn.python.learn.estimators.estimator) with x is
deprecated and will be removed after 2016-12-01. Instructions for
updating: Estimator is decoupled from Scikit Learn interface by moving
into separate class SKCompat. Arguments x, y and batch_size are only
available in the SKCompat class, Estimator will only accept input_fn.
Example conversion: est = Estimator(...) -> est =
SKCompat(Estimator(...))
UPDATE (2016-12-22):
I've tracked the warning to this file:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/layers/python/layers/feature_column_ops.py
and this code block:
except NotImplementedError:
with variable_scope.variable_scope(
None,
default_name=column.name,
values=columns_to_tensors.values()):
tensor = column._to_dense_tensor(transformed_tensor)
tensor = fc._reshape_real_valued_tensor(tensor, 2, column.name)
variable = [
contrib_variables.model_variable(
name='weight',
shape=[tensor.get_shape()[1], num_outputs],
initializer=init_ops.zeros_initializer(),
trainable=trainable,
collections=weight_collections)
]
predictions = math_ops.matmul(tensor, variable[0], name='matmul')
Note the line: tensor = fc._reshape_real_valued_tensor(tensor, 2, column.name)
The method signature is: _reshape_real_valued_tensor(input_tensor, output_rank, column_name=None)
The value 2 is hardcoded as the value of output_rank, but the boston.py example is passing in an input_tensor of rank 1. I will continue to investigate.
If you specify the shape of your tensor explicitly:
tf.constant(df[k].values, shape=[df[k].size, 1])
the warning should go away.
After I specify the shape of the tensor explicitly.
continuous_cols = {k: tf.constant(df[k].values, shape=[df[k].size, 1]) for k in CONTINUOUS_COLUMNS}
It works!

Resources