while running huggingface gpt2-xl model embedding index getting out of range - python-3.x

I am trying to run hugginface gpt2-xl model. I ran code from the quickstart page that load the small gpt2 model and generate text by the following code:
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained('gpt2')
generated = tokenizer.encode("The Manhattan bridge")
context = torch.tensor([generated])
past = None
for i in range(100):
print(i)
output, past = model(context, past=past)
token = torch.argmax(output[0, :])
generated += [token.tolist()]
context = token.unsqueeze(0)
sequence = tokenizer.decode(generated)
print(sequence)
This is running perfectly. Then I try to run gpt2-xl model.
I changed tokenizer and model loading code like following:
tokenizer = GPT2Tokenizer.from_pretrained("gpt2-xl")
model = GPT2LMHeadModel.from_pretrained('gpt2-xl')
The tokenizer and model loaded perfectly. But I a getting error on the following line:
output, past = model(context, past=past)
The error is:
RuntimeError: index out of range: Tried to access index 204483 out of table with 50256 rows. at /pytorch/aten/src/TH/generic/THTensorEvenMoreMath.cpp:418
Looking at error it seems that the embedding size is not correct. So I write the following line to specifically fetch the config file of gpt2-xl:
config = GPT2Config.from_pretrained("gpt2-xl")
But, here vocab_size:50257
So I changed explicitly the value by:
config.vocab_size=204483
Then after printing the config, I can see that the previous line took effect in the configuration. But still, I am getting the same error.

This was actually an issue I reported and they fixed it.
https://github.com/huggingface/transformers/issues/2774

Related

Calling VGG many times causes an out of memory error

I want to extract the VGG features of a set of images and keep them in memory in a dictionary. The dictionary ends up holding 8091 tensors each of shape (1,4096), but my machine crashes with an out of memory error after about 6% of the way. Does anybody have a clue why this is happening and how to prevent it?
In fact, this seems to be triggered by the call to VGG rather than the memory space, since storing the VGG classification is sufficient to trigger the error.
Below is the simplest code I've found to reproduce the error. Once a helper function is defined:
import torch, torchvision
from tqdm import tqdm
vgg = torchvision.models.vgg16(weights='DEFAULT')
def try_and_crash(gen_data):
store_out = {}
for i in tqdm(range(8091)):
my_output = gen_data(torch.randn(1,3,224,224))
store_out[i] = my_output
return store_out
Calling it to quickly produce a large tensor doesn't cause a fuss
just_fine = try_and_crash(lambda x: torch.randn(1,4096))
but calling it to use vgg causes the machine to crash:
will_crash = try_and_crash(vgg)
The problem is that each element of the dictionary store_out[i] also stores the gradients that led to its computation, therefore ends up being much larger than a simple 1x4096 element tensor.
Running the code with torch.no_grad(), or equivalently with torch.set_grad_enabled(False) solves the issue. We can test it by slightly changing the helper function
def try_and_crash_grad(gen_data, grad_enabled):
store_out = {}
for i in tqdm(range(8091)):
with torch.set_grad_enabled(grad_enabled):
my_output = gen_data(torch.randn(1,3,224,224))
store_out[i] = my_output
return store_out
Now the following works
works_fine = try_and_crash_grad(vgg, False)
while the following throws an out of memory error
crashes = try_and_crash_grad(vgg, True)

Loaded PyTorch model has a different result compared to saved model

I have a python script that trains and then tests a CNN model. The model weights/parameters are saved after testing through the use of:
checkpoint = {'state_dict': model.state_dict(),'optimizer' :optimizer.state_dict()}
torch.save(checkpoint, path + filename)
After saving I immediately load the model through the use of a function:
model_load = create_model(cnn_type="vgg", numberofclasses=len(cases))
And then, I load the model weights/parameters through:
model_load.load_state_dict(torch.load(filePath+filename), strict = False)
model_load.eval()
Finally, I feed this model the same testing data I used before the model was saved.
The problem is that the testing results are not the same when I compare the testing results of the model before saving and after loading. My hunch is that due to strict = False, some of the parameters are not being passed through to the model. However, when I make strict = True. I receive errors. Is there a work around this?
The error message is:
RuntimeError: Error(s) in loading state_dict for CNN:
Missing key(s) in state_dict: "linear.weight", "linear.bias", "linear 2.weight", "linea r2.bias", "linear 3.weight", "linear3.bias". Unexpected key(s) in state_dict: "state_dict", "optimizer".
You are loading a dictionary containing the state of your model as well as the optimizer's state. According to your error stack trace, the following should solve the issue:
>>> model_state = torch.load(filePath+filename)['state_dict']
>>> model_load.load_state_dict(model_state, strict=True)

The inference file that goes into the entry point of PyTorchModel to be deployed does not have an effect to the output of the predictor

I am currently running the code on AWS Sagemaker, trying to predict data using an already-trained model, accessed by MODEL_URL.
With the code below, the inference.py as the entry_point does not seem to have an effect on the result of the trained prediction model. Any changes in inference.py does not alter the output (the output is always correct). Is there something I am misunderstanding with how the model works? And how can I incorporate inference.py to the prediction model as the entry point?
role = sagemaker.get_execution_role()
model = PyTorchModel(model_data = MODEL_URL,
role = role,
framework_version = '0.4.0',
entry_point = '/inference.py',
source_dir = SOURCE_DIR)
predictor = model.deploy(instance_type = 'ml.c5.xlarge',
initial_instance_count = 1,
endpoint_name = RT_ENDPOINT_NAME)
result = predictor.predict(someData)
The entrypoint (inference.py) is the code file that defines how a model is loaded, input preprocessing, prediction logic, and output postprocessing.
Any changes in inference.py does not alter the output
What are you changing in inference.py that you expect alter the result of predictor.predict? If the underlying model_data is not changing, the entry point script will be using the same model. Are you making some change to loading the model in model_fn, or in processing predictions via input_fn, predict_fn, or output_fn?

ScriptRunConfig with datastore reference on AML

When trying to run a ScriptRunConfig, using :
src = ScriptRunConfig(source_directory=project_folder,
script='train.py',
arguments=['--input-data-dir', ds.as_mount(),
'--reg', '0.99'],
run_config=run_config)
run = experiment.submit(config=src)
It doesn't work and breaks with this when I submit the job :
... lots of things... and then
TypeError: Object of type 'DataReference' is not JSON serializable
However if I run it with the Estimator, it works. One of the differences is the fact that with a ScriptRunConfig we're using a list for parameters and the other is a dictionary.
Thanks for any pointers!
Being able to use DataReference in ScriptRunConfig is a bit more involved than doing just ds.as_mount(). You will need to convert it into a string in arguments and then update the RunConfiguration's data_references section with the DataReferenceConfiguration created from ds. Please see here for an example notebook on how to do that.
If you are just reading from the input location and not doing any writes to it, please check out Dataset. It allows you to do exactly what you are doing without doing anything extra. Here is an example notebook that shows this in action.
Below is a short version of the notebook
from azureml.core import Dataset
# more imports and code
ds = Datastore(workspace, 'mydatastore')
dataset = Dataset.File.from_files(path=(ds, 'path/to/input-data/within-datastore'))
src = ScriptRunConfig(source_directory=project_folder,
script='train.py',
arguments=['--input-data-dir', dataset.as_named_input('input').as_mount(),
'--reg', '0.99'],
run_config=run_config)
run = experiment.submit(config=src)
you can see this link how-to-migrate-from-estimators-to-scriptrunconfig in official documents.
The core code of using DataReference in ScriptRunConfig is
# if you want to pass a DataReference object, such as the below:
datastore = ws.get_default_datastore()
data_ref = datastore.path('./foo').as_mount()
src = ScriptRunConfig(source_directory='.',
script='train.py',
arguments=['--data-folder', str(data_ref)], # cast the DataReference object to str
compute_target=compute_target,
environment=pytorch_env)
src.run_config.data_references = {data_ref.data_reference_name: data_ref.to_config()} # set a dict of the DataReference(s) you want to the `data_references` attribute of the ScriptRunConfig's underlying RunConfiguration object.

Protocol problem with PyMc3 on jupyter notebook

I am working with the following code, but I get an error
import pymc3 as pm
import theano.tensor as tt
with pm.Model() as model:
alpha = 1.0/count_data.mean() # Recall count_data is the
# variable that holds our txt counts
lambda_1 = pm.Exponential("lambda_1", alpha)
lambda_2 = pm.Exponential("lambda_2", alpha)
tau = pm.DiscreteUniform("tau", lower=0, upper=n_count_data - 1)
with model:
idx = np.arange(n_count_data) # Index
lambda_ = pm.math.switch(tau > idx, lambda_1, lambda_2)
with model:
observation = pm.Poisson("obs", lambda_, observed=count_data)
with model:
step = pm.Metropolis()
trace = pm.sample(10000, tune=5000,step=step)
But I get the error
ValueError: must use protocol 4 or greater to copy this object; since getnewargs_ex returned keyword arguments.
I have windows-10, python-3.5.6,
pymc3- 3.5, ipython-6.5.0. Any help is deeply appreciated. Thanks in advance.
It sounds like this exception is being thrown by the joblib library, which uses pickle to send the model to different processes. The easiest fix is to use only a single core, by changing the last line to
trace = pm.sample(10000, tune=5000, step=step, cores=1, chains=4)
It will be hard to diagnose the problem with joblib without more details. Creating a fresh conda environment might help.
The workaround suggested by colcarroll did not work for me. The behavior you are seeing is related to PR#3140 of PyMC3, which you may want to track there. The solution and/or workaround may depend on how you are running theano (with or without GPU support).

Resources