I have been using PyTorch for a while now and am making a general RL framework. I am running into the question of whether to use np.arrays or tensors.
When would you not want to use tensors when available? What would make you choose numpy over pytorch? Obviously tensors are important for ML models, but what if you want to just do basic image processing or list manipulation?
I am tempted to use Tensors whenever possible but do not know of any pitfalls. (graph confusion? memory leaks??)
For example, I have a basic unfinished code snippet of collecting actions for an env, not sure whether to stick with numpy or not.
#dataclass
class Action(object):
"""
Handles actions, action space, and value verification.
"""
taken_action: np.array
raw_action: np.array
n_possible_values: int
action_space: gym.Space
def __post_init__(self):
if type(self.taken_action) is not np.array: taken_action = np.array([self.taken_action])
#pytest.mark.parametrize("env", sorted([env.id for env in gym.envs.registry.all()]))
def test_action_data_structure(env):
try:
init_env = gym.make(env)
except error.DependencyNotInstalled as e:
print(e)
return
taken_action = init_env.action_space.sample()
raw_action = np.random.rand(init_env.action_space.n)
state, reward, done, info = init_env.step(taken_action)
action_dataclass = Action(taken_action=taken_action, raw_action=raw_action,
n_possible_values=init_env.action_space.n, action_space=init_env.action_space)
Well, the answer is it depends. There is a tradeoff between speed and clarity when you debug your code. In case your matrices are huge you can jump between numpy and pytorch tensor from the reason the GPU will run the operations much faster than the delay that the conversion did. So it is tough to say where the threshold (of the dataset size) lies. I would try some simple operations to compare the options and then decide the best method that works for you.
In addition, I suggest you read this answer
and also this blog post
Related
I want to further improve the inference time from BERT.
Here is the code below:
for sentence in list(data_dict.values()):
tokens = {'input_ids': [], 'attention_mask': []}
new_tokens = tokenizer.encode_plus(sentence, max_length=512,
truncation=True, padding='max_length',
return_tensors='pt',
return_attention_mask=True)
tokens['input_ids'].append(new_tokens['input_ids'][0])
tokens['attention_mask'].append(new_tokens['attention_mask'][0])
# reformat list of tensors into single tensor
tokens['input_ids'] = torch.stack(tokens['input_ids'])
tokens['attention_mask'] = torch.stack(tokens['attention_mask'])
outputs = model(**tokens)
embeddings = outputs[0]
Is there a way to provide batches (like in training) instead of the whole dataset?
There are several optimizations that we can do here, which are (mostly) natively supported by the Huggingface tokenizer.
TL;DR, an optimized version would be this one, I have explained the ideas behind each change below.
def chunker(seq, batch_size=16):
return (seq[pos:pos + batch_size] for pos in range(0, len(seq), batch_size))
for sentence_batch in chunker(list(data_dict.values())):
tokenized_sentences = tokenizer(sentence_batch, max_length=512,
truncation=True, padding=True,
return_tensors="pt", return_attention_mask=True)
with torch.no_grad():
outputs = model(**tokenized_sentences)
The first optimization is to batch together several samples at the same time. For this, it is helpful to have a closer look at the actual __call__ function of the tokenizer, see here (bold highlight by me):
text (str, List[str], List[List[str]]) – The sequence or batch of sequences to be encoded [...].
This means it is enough if we can simply pass several samples at the same time, and we already get the readily processed batch back. I want to personally note that it would be in theory possible to pass the entire list of samples at once, but there are also some drawbacks that we go into later.
To actually pass a decently sized number of samples to the tokenizer, we need a function that can aggregate several samples from the dictionary (our batch-to-be) in a single iteration. I've used another Stackoverflow answer for this, see this post for several valid answers.
I've chosen the highest-voted answer, but do note that this creates and explicit copy, and might therefore not be the most memory-efficient solution. Then you can simply iterate over the batches, like so:
def chunker(seq, batch_size=16):
return (seq[pos:pos + batch_size] for pos in range(0, len(seq), batch_size))
for sentence_batch in chunker(list(data_dict.values())):
...
The next optimization is in the way you can call your tokenizer. Your code does this with many several steps, which can be aggregated into a single call. For the sake of clarity, I also point out which of these arguments are not required in your call (this often improves your code readability).
tokenized_sentences = tokenizer(sentence_batch, max_length=512,
truncation=True, padding=True,
return_tensors="pt", return_attention_mask=True)
with torch.no_grad(): # Just to be sure
outputs = model(**tokenized_sentences)
I want to comment on the use of some of the arguments as well:
max_length=512: This is only required if your value differs from the model's default max_length. For most models, this will otherwise default to 512.
return_attention_mask: Will also default to the model-specific values, and in most cases does not need to be set explicitly.
padding=True: If you noticed, this is different from your version, and arguably what gives you the most "out-of-the-box" speedup. By using padding=max_length, each sequence will be computing quite a lot of unnecessary tokens, since each input is 512 tokens long. For most real-world data I have seen, inputs tend to be much shorter, and therefore you only need to consider the longest sequence length in your batch. padding=True does exactly that. For actual (CPU inference) speedups, I have played around with some different sequence lengths myself, see my repository on Github. Noticeably, for the same CPU and different batch sizes, there is a 10x speedup possible.
Edit: I've added the torch.no_grad() here, too, just in case somebody else wants to use this snippet. I generally recommend to use it right before the piece of code that is actually affected by it, just so that nothing gets overlooked by accident.
Also, there are some more possible optimizations that require you to have a bit more insights into your data samples:
If the variance of sample lengths is quite drastic, you can get an even higher speedup if you sort your samples by length (ideally, tokenized length, but character length / word count will also give you an approximate idea). That way, when batching several samples together, you minimize the amount of padding that is required.
Maybe you might be interested in Intel OpenVINO backend for inference execution on CPU? It's currently work in progress on branch https://github.com/huggingface/transformers/pull/14203
I had the same issue of time inference with Bert on the CPU. I started using HuggingFace Pipelines for inference, and the Trainer for training.
It's well documented on HuggingFace.
The pipeline makes it simple to perform inference on batches. On one pass, you can get the inference done instead of looping on a sequence of single texts.
I have an existing model where I load some pre-trained weights and then do prediction (one image at a time) in pytorch. I am trying to basically convert it to a pytorch lightning module and am confused about a few things.
So currently, my __init__ method for the model looks like this:
self._load_config_file(cfg_file)
# just creates the pytorch network
self.create_network()
self.load_weights(weights_file)
self.cuda(device=0) # assumes GPU and uses one. This is probably suboptimal
self.eval() # prediction mode
What I can gather from the lightning docs, I can pretty much do the same, except not to do the cuda() call. So something like:
self.create_network()
self.load_weights(weights_file)
self.freeze() # prediction mode
So, my first question is whether this is the correct way to use lightning? How would lightning know if it needs to use the GPU? I am guessing this needs to be specified somewhere.
Now, for the prediction, I have the following setup:
def infer(frame):
img = transform(frame) # apply some transformation to the input
img = torch.from_numpy(img).float().unsqueeze(0).cuda(device=0)
with torch.no_grad():
output = self.__call__(Variable(img)).data.cpu().numpy()
return output
This is the bit that has me confused. Which functions do I need to override to make a lightning compatible prediction?
Also, at the moment, the input comes as a numpy array. Is that something that would be possible from the lightning module or do things always have to use some sort of a dataloader?
At some point, I want to extend this model implementation to do training as well, so want to make sure I do it right but while most examples focus on training models, a simple example of just doing prediction at production time on a single image/data point might be useful.
I am using 0.7.5 with pytorch 1.4.0 on GPU with cuda 10.1
LightningModule is a subclass of torch.nn.Module so the same model class will work for both inference and training. For that reason, you should probably call the cuda() and eval() methods outside of __init__.
Since it's just a nn.Module under the hood, once you've loaded your weights you don't need to override any methods to perform inference, simply call the model instance. Here's a toy example you can use:
import torchvision.models as models
from pytorch_lightning.core import LightningModule
class MyModel(LightningModule):
def __init__(self):
super().__init__()
self.resnet = models.resnet18(pretrained=True, progress=False)
def forward(self, x):
return self.resnet(x)
model = MyModel().eval().cuda(device=0)
And then to actually run inference you don't need a method, just do something like:
for frame in video:
img = transform(frame)
img = torch.from_numpy(img).float().unsqueeze(0).cuda(0)
output = model(img).data.cpu().numpy()
# Do something with the output
The main benefit of PyTorchLighting is that you can also use the same class for training by implementing training_step(), configure_optimizers() and train_dataloader() on that class. You can find a simple example of that in the PyTorchLightning docs.
Even though above answer suffices, if one takes note of following line
img = torch.from_numpy(img).float().unsqueeze(0).cuda(0)
One has to put both the model as well as image to the right GPU. On multi-gpu inference machine, this becomes a hassle.
To solve this, .predict was also recently produced, see more at https://pytorch-lightning.readthedocs.io/en/stable/deploy/production_basic.html
It looks like parameters and children show the same info, so what is the difference between them?
import torch
print('torch.__version__', torch.__version__)
m = torch.load('imagenet_resnet18.pth')
print(m.parameters)
print(m.children)
model.parameters() is a generator that returns tensors containing your model parameters.
model.children() is a generator that returns layers of the model from which you can extract your parameter tensors using <layername>.weight or <layername>.bias
Visit this link for a simple tutorial on accessing and freezing model layers.
The (only, during my writing) current answer is not to the point, and thus misleading in my own opinion. By the current docs(08/23/2022):
children():
Returns an iterator over immediate children modules.
This should mean that it will stop at non-leaf node like torch.nn.Sequential, torch.nn.ModuleList, etc.
parameters(recurse=True):
Returns an iterator over module parameters. This is typically passed to an optimizer.
"Passed to an optimizer" should imply that recursive cases are taken care by the team. Just pass the return value/object to the optimizer.
Since I know you're lazy developers, you must read this answer from PyTorch forum to see the output of children() done by someone: https://discuss.pytorch.org/t/module-children-vs-module-modules/4551/4?u=raining_day513
I have the following snippet running to train a model for text classification. I optmized it quite a bit and it's running pretty smoothly however, it still uses a lot of RAM. Our dataset is huge (13 million documents + 18 million words in the vocabulary) but the point in execution throwing the error is very weird, in my opinion. The script:
encoder = LabelEncoder()
y = encoder.fit_transform(categories)
classes = list(range(0, len(encoder.classes_)))
vectorizer = CountVectorizer(vocabulary=vocabulary,
binary=True,
dtype=numpy.int8)
classifier = SGDClassifier(loss='modified_huber',
n_jobs=-1,
average=True,
random_state=1)
tokenpath = modelpath.joinpath("tokens")
for i in range(0, len(batches)):
token_matrix = joblib.load(
tokenpath.joinpath("{}.pickle".format(i)))
batchsize = len(token_matrix)
classifier.partial_fit(
vectorizer.transform(token_matrix),
y[i * batchsize:(i + 1) * batchsize],
classes=classes
)
joblib.dump(classifier, modelpath.joinpath('classifier.pickle'))
joblib.dump(vectorizer, modelpath.joinpath('vectorizer.pickle'))
joblib.dump(encoder, modelpath.joinpath('category_encoder.pickle'))
joblib.dump(options, modelpath.joinpath('extraction_options.pickle'))
I got the MemoryError at this line:
joblib.dump(vectorizer, modelpath.joinpath('vectorizer.pickle'))
At this point in execution, training is finished and the classifier is already dumped. It should be collected by the garbage collector in case more memory is needed. In addition to it, why should joblib allocate so much memory if it isn't even compressing the data.
I do not have deep knowledge of the inner workings of the python garbage collector. Should I be forcing gc.collect() or use 'del' statments to free those objects that are no longer needed?
Update:
I have tried using the HashingVectorizer and, even though it greatly reduces memory usage, the vectorizing is way slower making it not a very good alternative.
I have to pickle the vectorizer to later use it in the classification process so I can generate the sparse matrix that is submitted to the classifier. I will post here my classification code:
extracted_features = joblib.Parallel(n_jobs=-1)(
joblib.delayed(features.extractor) (d, extraction_options) for d in documents)
probabilities = classifier.predict_proba(
vectorizer.transform(extracted_features))
predictions = category_encoder.inverse_transform(
probabilities.argmax(axis=1))
trust = probabilities.max(axis=1)
If you are providing your custom vocabulary to the CountVectorizer, it should not be a problem to recreate it later on, during classification. As you provide set of strings instead of a mapping, you probably want to use the parsed vocabulary, which you can access with:
parsed_vocabulary = vectorizer.vocabulary_
joblib.dump(parsed_vocabulary, modelpath.joinpath('vocabulary.pickle'))
and then load it and use to re-create the CountVectorizer:
vectorizer = CountVectorizer(
vocabulary=parsed_vocabulary,
binary=True,
dtype=numpy.int8
)
Note that you do not need to use joblib here; the standard pickle should perform the same; you might get better results using any of available alternatives, with PyTables being worth mentioning.
If that uses to much of the memory too, you should try using the original vocabulary for recreation of the vectorizer; currently, when provided with a set of strings as vocabulary, vectorizers just convert sets to sorted lists so you shouldn't need to worry about reproducibility (although I would double check that before using in production). Or you could just convert the set to a list on your own.
To sum up: because you do not fit() the Vectorizer, the whole added value of using CountVectorizer is its transform() method; as the whole needed data is the vocabulary (and parameters) you might reduce the memory consumption pickling just your vocabulary, either processed or not.
As you asked for answer drawing from official sources, I would like to point you to: https://github.com/scikit-learn/scikit-learn/issues/3844 where an owner and a contributor of scikit-learn mention recreating a CountVectorizer, albeit for other purposes. You may have better luck reporting your problems in the linked repo, but make sure to include a dataset which causes excessive memory usage issues to make it reproducible.
And finally you may just use HashingVectorizer as mentioned earlier in a comment.
PS: regarding the use of gc.collect() - I would give it a go in this case; regarding the technical details, you will find many questions on SO tackling this issue.
As far as I understand, the strength of PyTorch is supposed to be that it works with dynamic computational graphs. In the context of NLP, that means that sequences with variable lengths do not necessarily need to be padded to the same length. But, if I want to use PyTorch DataLoader, I need to pad my sequences anyway because the DataLoader only takes tensors - given that me as a total beginner does not want to build some customized collate_fn.
Now this makes me wonder - doesn’t this wash away the whole advantage of dynamic computational graphs in this context?
Also, if I pad my sequences to feed it into the DataLoader as a tensor with many zeros as padding tokens at the end (in the case of word ids), will it have any negative effect on my training since PyTorch may not be optimized for computations with padded sequences (since the whole premise is that it can work with variable sequence lengths in the dynamic graphs), or does it simply not make any difference?
I will also post this question in the PyTorch Forum...
Thanks!
In the context of NLP, that means that sequences with variable lengths do not necessarily need to be padded to the same length.
This means that you don't need to pad sequences unless you are doing data batching which is currently the only way to add parallelism in PyTorch. DyNet has a method called autobatching (which is described in detail in this paper) that does batching on the graph operations instead of the data, so this might be what you want to look into.
But, if I want to use PyTorch DataLoader, I need to pad my sequences anyway because the DataLoader only takes tensors - given that me as a total beginner does not want to build some customized collate_fn.
You can use the DataLoader given you write your own Dataset class and you are using batch_size=1. The twist is to use numpy arrays for your variable length sequences (otherwise default_collate will give you a hard time):
from torch.utils.data import Dataset
from torch.utils.data.dataloader import DataLoader
class FooDataset(Dataset):
def __init__(self, data, target):
assert len(data) == len(target)
self.data = data
self.target = target
def __getitem__(self, index):
return self.data[index], self.target[index]
def __len__(self):
return len(self.data)
data = [[1,2,3], [4,5,6,7,8]]
data = [np.array(n) for n in data]
targets = ['a', 'b']
ds = FooDataset(data, targets)
dl = DataLoader(ds, batch_size=1)
print(list(enumerate(dl)))
# [(0, [
# 1 2 3
# [torch.LongTensor of size 1x3]
# , ('a',)]), (1, [
# 4 5 6 7 8
# [torch.LongTensor of size 1x5]
# , ('b',)])]
Now this makes me wonder - doesn’t this wash away the whole advantage of dynamic computational graphs in this context?
Fair point but the main strength of dynamic computational graphs are (at least currently) mainly the possibility of using debugging tools like pdb which rapidly decrease your development time. Debugging is way harder with static computation graphs. There is also no reason why PyTorch would not implement further just-in-time optimizations or a concept similar to DyNet's auto-batching in the future.
Also, if I pad my sequences to feed it into the DataLoader as a tensor with many zeros as padding tokens at the end [...], will it have any negative effect on my training [...]?
Yes, both in runtime and for the gradients. The RNN will iterate over the padding just like normal data which means that you have to deal with it in some way. PyTorch supplies you with tools for dealing with padded sequences and RNNs, namely pad_packed_sequence and pack_padded_sequence. These will let you ignore the padded elements during RNN execution, but beware: this does not work with RNNs that you implement yourself (or at least not if you don't add support for it manually).