How to slice Kinetics400 training dataset? (pytorch) - pytorch

I am trying to run the official script for video classification.
I want to tweak some functions and running through all examples would cost me too much time.
I wonder how can I slice the training kinetics dataset based on that script.
This is the code I added before
train_sampler = RandomClipSampler(dataset.video_clips, args.clips_per_video)
in the script: (let's say I just want to run 100 examples.)
tr_split_len = 100
dataset = torch.utils.data.random_split(dataset, [tr_split_len, len(dataset)-tr_split_len])[0]
Then when hitting train_sampler = RandomClipSampler(dataset.video_clips, args.clips_per_video)
, it pops out the error:
AttributeError: 'Subset' object has no attribute 'video_clips'
Yeah, so the type of dataset converts from torchvision.datasets.kinetics.Kinetics400 to torch.utils.data.dataset.Subset.
I understand. So how can I do it? (hopefully not the way using break in the dataloader loop).
Thanks.

It seems that torchvision.datasets.kinetics.Kinetics400 internally uses an object of class VideoClips to store the information about the clips. It is stored in the member variable Kinetics4000().video_clips.
The VideoClips class has a function called subset, that takes a list of indices and returns a new VideoClips object with only the clips with the specified indices. You could then just replace the old VideoClips object with the new one in your dataset.

Related

Huggingface token classification pipeline giving different outputs than just calling model() directly

I am trying to mask named entities in text, using a roberta based model.
The suggested way to use the model is via Huggingface pipeline but i find that it is rather slow to use it that way. Using a pipeline on text data also prevents me from using my GPU for computation, as the text cannot be put onto the GPU.
Due to this, i decided to put the model on the GPU, tokenize the text myself(using the same tokenizer i pass to the pipeline), put the tokens on the GPU and pass them to the model afterwards. This works, but the outputs of the model used directly like this and not via the pipeline differ significantly.
I cant find a reason for this nor a way to fix it.
I tried reading through the token classification pipeline source code but couldnt find a difference in my usage compared to what the pipeline does.
Examples of code which produce different results:
Suggested usage in the model card:
ner_tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-large-finetuned-conll03-english")
model = AutoModelForTokenClassification.from_pretrained("xlm-roberta-large-finetuned-conll03-english")
classifier = pipeline("ner", model=model, tokenizer=ner_tokenizer, framework='pt')
out = classifier(dataset['text'])
'out' is now a list of lists of dictionary objects which hold information on each named entity in a given string in list of strings 'dataset['text']'.
My custom usage:
text_batch = dataset['text']
encodings_batch = ner_tokenizer(text_batch,padding="max_length", truncation=True, max_length=128, return_tensors="pt")
input_ids = encodings_batch['input_ids']
input_ids = input_ids.to(TORCH_DEVICE)
outputs = model(input_ids)[0]
outputs = outputs.to('cpu')
label_ner_ids = outputs.argmax(dim=2).to('cpu')
'label_ner_ids' is now a tensor of 2 dimensions, the elements of which represent the labels for each token in a given line of text, so label_ner_id[i,j] is the label for the j-th token in the i-th string of text in the list of strings 'text_batch'. The token labels here differ from the outputs of the pipeline usage.

Build a pytorch model wrap around another pytorch model

Is it possible to wrap a pytorch model inside another pytorch module? I could not do it the normal way like in transfer learning (simply concatenating some more layers) because in order to get the intended value for the next 'layer', I need to wait the last layer of the first module to generate multiple outputs (say 100) and to use all those outputs to get the value for the next 'layer' (say taking the max of those outputs). I tried to define the integrated model as something like the following:
class integrated(nn.Module):
def __init__(self):
super(integrated, self)._init_()
def forward(self, x):
model = VAE(
encoder_layer_sizes=args.encoder_layer_sizes,
latent_size=args.latent_size,
decoder_layer_sizes=args.decoder_layer_sizes,
conditional=args.conditional,
num_labels=10 if args.conditional else 0).to(device)
device = torch.device('cpu')
model.load_state_dict(torch.load(r'...')) # the first model is saved somewhere else beforehand
model.eval()
temp = []
for j in range(100):
x = model(x)
temp.append(x)
y=max(temp)
return y
The reason I would like to do that is the library I need to use requires the input itself to be a pytorch module. Otherwise I could simply leave the last part outside of the module.
Yes you can definitely use a Pytorch module inside another Pytorch module. The way you are doing this in your example code is a bit unusual though, as external modules (VAE, in your case) are more often initialized in the __init__ function and then saved as attributes of the main module (integrated). Among other things, this avoids having to reload the sub-module every time you call forward.
One other thing that looks a bit funny is your for loop over repeated invocations of model(x). If there is no randomness involved in model's evaluation, then you would only need a single call to model(x), since all 100 calls will give the same value. So assuming there is some randomness, you should consider whether you can get the desired effect by batching together 100 copies of x and using a single call to model with this batched input. This ultimately depends on additional information about why you are calling this function multiple times on the same input, but either way, using a single batched evaluation will be a lot faster than using many unbatched evaluations.

I want to understand Python prediction model

I am trying to code a predictive model and i Found this code somewhere and wanted to know what it does mean please. Here it is "X_train.reset_index(inplace = True)"?
I think it would help if you provide more context to your question. But in the meanwhile, it seems that the line of code that you have shown here is enumerating the training dataset of whatever model you're working with (usually X denotes the data and Y denotes the labels).
The dataset is a pandas DataFrame object, and the reset_index function enumerates the items in the DataFrame so that each item in the DataFrame is numbered instead of named. You can find more information about this in the documentation for this method:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.reset_index.html

How to get batch size for `tf.PaddingFIFOQueue.dequeu_up_to` in case of fewer elements?

I'm referring to the example codes in
http://www.wildml.com/2016/08/rnns-in-tensorflow-a-practical-guide-and-undocumented-features/
and
https://indico.io/blog/tensorflow-data-inputs-part1-placeholders-protobufs-queues/
for standard approaches in feeding the data to the language models (rnn) using TensorFlow.
So before feeding the input, I'm making sure they are padded to the max input length in the current batch and then randomly shuffled within the current batch. so far good.
However, I have difficulty in specifying the initial state for tf.nn.dynamic_rnn whose shape depends on the current batch_size.
If the batch size is fixed there is should not be any problem.
However, if I'm using tf.PaddingFIFOQueue.dequeue_up_to(batch_size), it may be possible to return less than the batch_size if not many elements in the queue. (which is possible if dequeuing the last set of elements).
In this case how do we specify the initial state with the exact batch size returned.
You can use the None dimension in your Tensors to specify TensorFlow that the batch dimension can be different from one run to another.
You might want to read this faq on tensor shapes to get a better sense, and in particular this section (formatting is mine):
How do I build a graph that works with variable batch sizes?
It is often useful to build a graph that works with variable batch sizes, for example so that the same code can be used for (mini-)batch training, and single-instance inference. The resulting graph can be saved as a protocol buffer and imported into another program.
When building a variable-size graph, the most important thing to remember is not to encode the batch size as a Python constant, but instead to use a symbolic Tensor to represent it. The following tips may be useful:
Use batch_size = tf.shape(input)[0] to extract the batch dimension from a Tensor called input, and store it in a Tensor called batch_size.
Use tf.reduce_mean instead of tf.reduce_sum(...) / batch_size.
If you use placeholders for feeding input, you can specify a variable batch dimension by creating the placeholder with tf.placeholder(..., shape=[None, ...]). The None element of the shape corresponds to a variable-sized dimension.

Tensorflow: Setting an array element with a sequence

I'm trying to train a CNN using my own image dataset, but when passing the batch data and label to the feed_dict I get the error ValueError: setting an array element with a sequence from what I read here, this is a dimension issue, and probably coming from my batch_label Tensor, but I couldn't figure out how to make it a one-hot Tensor (what my graph expects).
I uploaded the full code as a gist here: https://gist.github.com/guivn/f7f753547f77a3b12992
TL;DR: You can't feed a tf.Tensor object (viz. batch_data and batch_labels in your gist) as the value for another tensor. (I believe the error message should be clearer about this in more recent versions of TensorFlow.)
Unfortunately you can't currently use the feed/tf.placeholder() mechanism to pass the result of one TensorFlow graph to another. We are investigating ways to make this easier, since it is a common confusion and feature request. For your exact program, it should be easy to solve this however. Simply move the lines that create the input and replace the placeholders with them. Your program will then look something like:
with graph.as_default():
# Input data.
filename_and_label_tensor = tf.train.string_input_producer(['train.txt'], shuffle=True)
data, label = parse_csv(filename_and_label_tensor)
tf_train_dataset, tf_train_labels = tf.train.batch([data, label], batch_size, num_threads=4)
# Rest of the model construction goes here....
Typically, if you want to pass another dataset through the same model—e.g. for evaluation—it's easiest to make another copy of the graph (perhaps sharing the same tf.Variable objects).

Resources