Tokenizer can add padding without error, but data collator cannot - pytorch

I'm trying to fine tune a GPT2-based model on my data using the run_clm.py example script from HuggingFace.
I have a .json data file that looks like this:
...
{"text": "some text"}
{"text": "more text"}
...
I had to change the default behavior of the script that used to concatenate input text, because all my examples are separate demonstrations that should not be concatenated:
def add_labels(example):
example['labels'] = example['input_ids'].copy()
return example
with training_args.main_process_first(desc="grouping texts together"):
lm_datasets = tokenized_datasets.map(
add_labels,
batched=False,
# batch_size=1,
num_proc=data_args.preprocessing_num_workers,
load_from_cache_file=not data_args.overwrite_cache,
desc=f"Grouping texts in chunks of {block_size}",
)
This essentially only adds the appropriate 'labels' field required by CLM.
However since GPT2 has a 1024-sized context-window, the examples should be padded to that length.
I can achieve this by modifying the tokenization procedure like this:
def tokenize_function(examples):
with CaptureLogger(tok_logger) as cl:
output = tokenizer(
examples[text_column_name], padding='max_length') # added: padding='max_length'
# ...
The training runs correctly.
However, I believe this should not be done by the tokenizer, but by the data collator instead. When I remove padding='max_length' from the tokenizer, I get the following error:
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`labels` in this case) have excessive nesting (inputs type `list` where type `int` is expected).
And also, above that:
Traceback (most recent call last):
File "/home/jan/repos/text2task/.venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 716, in convert_to_tensors
tensor = as_tensor(value)
ValueError: expected sequence of length 9 at dim 1 (got 33)
During handling of the above exception, another exception occurred:
To fix this, I have created a data collator that should do the padding:
data_collator = DataCollatorWithPadding(tokenizer, padding='max_length')
This is what is passed to the trainer. However, the above error remains.
What's going on?

I managed to fix the error but I'm really unsure about my solution, details below. Will accept a better answer.
This seems to solve it:
data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, padding=True)
Found in the documentation
It seems like DataCollatorWithPadding doesn't pad the labels?
My problem is about generating an output sequence from an input sequence, so I'm guessing that using DataCollatorForSeq2Seq is what I actually want to do. However, my data does not have separate input and target columns, but a single text column (that contains a string input => target). I'm not really that this collator is intended to be used for GPT2...

Related

Is it possible to get dataset file infromation at the time of test a model?

My dataset code is like the below one; Here, X_test is a list[list] and y_test is list[Path]
The first.py file
self.test_dataset = LongDataset(
X_path=X_test,
y_path=y_test,
transform=val_transforms,
)
The rest of the part is as usual (dataloader)
def test_dataloader(self):
return DataLoader(self.test_dataset, batch_size=1, num_workers=8)
In the second.py file
The DataModule
data_module = DataModuleLong(batch_size=3,)
The Trainer
trainer = Trainer(gpus=1)
trainer.test(
model=model,
ckpt_path=ckpt_path,
datamodule=data_module,
)
The train_step() in the third.py file
def test_step(self, batch, batch_idx: int):
inputs, targets = batch
logits = self(inputs)
...
...
...
Now, is it possible to print (in the train_step()) the (inputs, targets) filename (or the full path) I am sending from test_dataset as (X_path, y_path)?
Essentially, what you want to do is get the index of each batch element in the batch returned by the dataloader object (from there it is trivial to index the dataset to get the desired data elements (in this case file paths).
Now the short answer is that there is no directly implemented way to return this data using the dataloader. However, there are a few workarounds:
Pass your own BatchSampler or Sampler object to the DataLoader constructor. Unfortunately there's not a simple way to query the Sampler for the current batch because it relies on generators (where yielding the next sample clears it and loades the next one. This is the same reason why you can't directly access the batch indices of the Dataloader. So to use this method, you'd have to pass a sampler wherein you know a priori which indices will be returned on the i-th query to the sampler. Not an ideal solution.
Create a custom dataset object - this is actually extremely easy to do, simply inherit from the torch.data object and implement the __init__, __len__ and __getitem__ methods. The __getitem__ method takes an index (let's say idx) as input and returns that index of the dataset. You can essentially copy the code for the existing LongDataset line for line, but simply append idx to the returned values from the __getitem__ method. I would demonstrate but you don't indicate where the LongDataset code comes from.
def __getitem__(self,idx):
... #load files, preprocess, etc.
return data, idx
Now dataloader will automatically zip the idx values for each, so you can simply replace the existing line with:
inputs, targets, indices = batch
data_paths = [self.test_dataset.file_paths[idx] for idx in indices]
The second solution is by far preferable as it is more transparently easy to understand.

Customise train_step in model.fit() "OperatorNotAllowedInGraphError: iterating over `tf.Tensor` is not allowed: AutoGraph did convert this function"

I am trying to write a custom train_step to use in the tf.keras.Model.fit() function. I am following tensor flow tutorial. Here in the train_step function from what I understand the input argument data is supposed to be the training dataset that I am about to pass in Model.fit() function. My dataset is TFRecordDataset. My dataset gives three particular features i.e. image, labels and the box. So, in the train_step function i am first trying to get the img, labels and box parameters from the data argument that is passed.
def train_step(self, data):
print("printing data fed to train_step")
print(data)
img, label, gt_boxes = data
if self.DEBUG:
if(img == None):
print("img input in train step is none")
with tf.GradientTape() as tape:
rpn_classification, rpn_regression = self(img, training=True)
self.tf_rpn_target_generation_layer(gt_boxes, rpn_regression)
loss = self.rpn_loss_function(rpn_classification)
trainable_vars = self.trainable_variables
gradients = tape.gradient(loss, trainable_vars)
self.optimizer.apply_gradients(zip(gradients, trainable_vars))
loss_tracker.update_state(loss)
#mae_metric.update_state()
return [loss_tracker]
The above is the code I use for my custom train_step function. When I run the fit, I get the following error
OperatorNotAllowedInGraphError: iterating over tf.Tensor is not allowed: AutoGraph did convert this function. This might indicate you are trying to use an unsupported feature.
I have used shuffle, cache, and repeat operations on my training dataset. Can anyone please help me understand why exactly this error appears?
From my previous experience, i generally create an iterator for the dataset followed by get_next operation to obtain the features.
Edit:
I have tried the following procedures but did not yield any outcome
Since the data being sent into the train_step is a dataset object, I have used tf.raw_ops.IteratorGetNext method to access the elements of the iterator which returned an error saying
"TypeError: Input 'iterator' of 'IteratorGetNext' Op has type string that does not match the expected type of resource."
To fix this error, I have assumed that it was likely tensorflow returning iterator graph and hence unable to access the elements, so I have added run_eagerly=True argument to the model.compile() function which returned gibberish being printed and the same error.
Epoch 1/5
printing data fed to train_step
Tensor("Shape:0", shape=(0,), dtype=int32)
Tensor("IteratorGetNext:0", shape=(), dtype=string)
I have found the solution. The data that is being passed to my step function is an iterator and hence I have to use tf.raw_ops.IteratorGetNext method to access the contents of the iterator.
When doing this I initially got another error saying that the iterator type does not match the expected type of resource and when debugged carefully I understood that the read_tfrecords mapping that I had to do to the dataset was unsuccessful and that lead to the dataset still containing unmapped tfrecords of format tf.string which is not an expected type of resource for the train_Step.

code error in word2vec program for DNA sequence

I have been trying to develop a code to read nucleotide in fasta format as strings(each input as one word) and then use already known binding site sequences(11 bp long) to search amongst the nucleotide sequences through word2vec model
The fasta file looks like and all values are read in sequences as string
`sequences:
ATCGTGACGTGACGTGACGT
CGTAGCTAGAGCTAGCGGATCGA
and the binding sites are stored as a column in dataframe as df['binding']
ATGACTCAGCA
GTGACTAAGCA
ATGACTCAGCA
ATGACTCAGCA
...
Here is my code in python:
import gensim
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
model = gensim.models.Word2Vec(sequences, size=2, min_count=len(sequences), sg = 1)
model.train(sequences,total_examples=len(sequences),epochs=10)
w1 = df['binding']
model.wv.most_similar(positive=w1)
I was hoping to get a relation between each binding sites but it throws error as KeyError: "word 'ATGACTCAGCA' not in vocabulary" here ATGACTCAGCA is the first value in df['binding']
If I change the w1 = df['binding'] to w1='A', I get the results as
[('T', 0.9952122569084167),
('G', 0.9772425889968872),
('C', 0.9460670351982117)]
What should change to get relation between two binding sites and not two/more base pairs?
You need to be sure your sequences is a python sequence, where each item is a list-of-tokens, where the tokens are the 'words' you want to look up (such as multiple related 11-character 'binding sites'). If it's a sequence of strings with just 'AGTC' character, the tokens will just be A, G, T, C.
A size=2 probably won't generate interesting vectors, at least not for a vocabulary of hundreds or thousands of tokens.
A min_count as long as your full set of examples will throw away any token that doesn't appear at least that many times.
You don't need to call train() if you supplied the dataset to the class-initialization: it will have already launched training automatically. (If you're running with logging at the INFO level, this would be obvious from the output.)

How to include multiple input tensor in keras.model.fit_generator

I am a keras rookie and I need some help in working with keras after many days struggling at this problem. Please ask for further information if there is any ambiguity.
Currently, I am trying to modify the code from a link.According to their network model, there are 2 input tensors expected. Now I have trouble including 2 input tensors into the source code provided by them.
Function Boneage_prediction_model() initiates a model of 2 input tensors.
def Boneage_prediction_model():
i1 = Input(shape=(500, 500, 1), name='input_img') # the 1st input tensor
i2 = Input(shape=(1,), name='input_gender') # the 2nd input tensor
... ...
model = Model(inputs=(i1, i2), outputs=o) # define model input
with both i1 and i2
... ...
#using model.fit_generator to instantiate
# datagen is initiated by keras.preprocessing.image.ImageDataGenerator
# img_train is the 1st network input, and boneage_train is the training label
# gender_train is the 2nd network input
model.fit_generator(
(datagen.flow(img_train, boneage_train, batch_size=10),
gender_train),
... ...
)
I tried many ways to combine the two (datagen.flow(img_train, boneage_train, batch_size=10) and gender_train) as stated above, but it failed and kept reporting errors
such as the following,
ValueError: Error when checking model input: the list of Numpy arrays that you are passing to your model is not the size the model expected. Expected to see 2 array(s), but instead got the following list of 1 arrays: [array([[[[-0.26078433],
[-0.26078433],
[-0.26078433],
...,
[-0.26078433],
[-0.26078433],
[-0.26078433]],
[[-0.26078433],
[-0.26...
If I understand you correctly, you want to have two inputs for one network and have one label for the combined output. In the official documentation for the fit_generator there is an example with multiple inputs.
Using a dictionary to map the multiple inputs would result in:
model.fit_generator(
datagen.flow({'input_img':img_train, 'input_gender':gender_train}, boneage_train, batch_size=10),
...
)
After failure either blindly to simply combine the 2 inputs, or as another contributor suggested, to use a dictionary to map the multiple inputs, I realized it seems to be the problem of datagen.flow which keeps me from combining a image tensor input and a categorical tensor input. datagen.flow is initiated by keras.preprocessing.image.ImageDataGenerator with the goal of preprocessing the input images. Therefore chances are that it is inappropriate to combine the 2 inputs inside datagen.flow. Additionally, fit_generator seems to expect an input of generator type, and what I did as proposed in my question is wrong, though I do not fully understand the mechanism of this function.
As I looked up carefully in other codes written by the team, I learned that I need to write a generator to combine the two. The solution is as following,
def combined_generators(image_generator, gender_data, batch_size):
gender_generator = cycle(batch(gender_data, batch_size))
while True:
nextImage = next(image_generator)
nextGender = next(gender_generator)
assert len(nextImage[0]) == len(nextGender)
yield [nextImage[0], nextGender], nextImage[1]
def batch(iterable, n=1):
l = len(iterable)
for ndx in range(0, l, n):
yield iterable[ndx:min(ndx + n, l)]
train_gen_wrapper = combined_generators(train_gen_boneage, train_df_boneage['male'], BATCH_SIZE_TRAIN)
model.fit_generator(train_gen_wrapper, ... )

Seq2seq for non-sentence, float data; stuck configuring the decoder

I am trying to apply sequence-to-sequence modelling to EEG data. The encoding works just fine, but getting the decoding to work is proving problematic. The input-data has the shape None-by-3000-by-31, where the second dimension is the sequence-length.
The encoder looks like this:
initial_state = lstm_sequence_encoder.zero_state(batchsize, dtype=self.model_precision)
encoder_output, state = dynamic_rnn(
cell=LSTMCell(32),
inputs=lstm_input, # shape=(None,3000,32)
initial_state=initial_state, # zeroes
dtype=lstm_input.dtype # tf.float32
)
I use the final state of the RNN as the initial state of the decoder. For training, I use the TrainingHelper:
training_helper = TrainingHelper(target_input, [self.sequence_length])
training_decoder = BasicDecoder(
cell=lstm_sequence_decoder,
helper=training_helper,
initial_state=thought_vector
)
output, _, _ = dynamic_decode(
decoder=training_decoder,
maximum_iterations=3000
)
My troubles start when I try to implement inference. Since I am using non-sentence data, I do not need to tokenize or embed, because the data is essentially embedded already. The InferenceHelper class seemed the best way to achieve my goal. So this is what I use. I'll give my code then explain my problem.
def _sample_fn(decoder_outputs):
return decoder_outputs
def _end_fn(_):
return tf.tile([False], [self.lstm_layersize]) # Batch-size is sequence-length because of time major
inference_helper = InferenceHelper(
sample_fn=_sample_fn,
sample_shape=[32],
sample_dtype=target_input.dtype,
start_inputs=tf.zeros(batchsize_placeholder, 32), # the batchsize varies
end_fn=_end_fn
)
inference_decoder = BasicDecoder(
cell=lstm_sequence_decoder,
helper=inference_helper,
initial_state=thought_vector
)
output, _, _ = dynamic_decode(
decoder=inference_decoder,
maximum_iterations=3000
)
The Problem
I don't know what the shape of the inputs should be. I know the start-inputs should be zero because it is the first time-step. But this throws errors; it expects the input to be (1,32).
I also thought I should pass the output of each time-step unchanged to the next. However, this raises problems at run-time: the batch-size varies, so the shape is partial. The library throws an exception at this as it tries to convert the start_input to a tensor:
...
self._start_inputs = ops.convert_to_tensor(
start_inputs, name='start_inputs')
Any ideas?
This is a lesson in poor documentation.
I fixed my problem, but failed to address the variable batch-size problem.
The _end_fn was causing problems I was unaware of. I also managed to work out what the appropriate fields are for the InferenceHelper. I've given the fields names in case anyone needs guidance in future
def _end_fn(_):
return tf.tile([False], [batchsize])
inference_helper = InferenceHelper(
sample_fn=_sample_fn,
sample_shape=[lstm_number_of_units], # In my case, 32
sample_dtype=tf.float32, # Depends on the data
start_inputs=tf.zeros((batchsize, lstm_number_of_units)),
end_fn=_end_fn
)
As for the batch-size problem, there are two things I'm considering:
Changing the internal state of my model object. My TensorFlow computation graph is built inside a class. A class-field records the batch-size. Changing this during training may work. Or:
Pad the batches so that they are 200 sequences long. This will waste time.
Preferably I'd like a way to dynamically manage the batch-sizes.
EDIT: I found a way. It involves simply substituting square-brackets for parentheses:
inference_helper = InferenceHelper(
sample_fn=_sample_fn,
sample_shape=[self.lstm_layersize],
sample_dtype=target_input.dtype,
start_inputs=tf.zeros([batchsize, self.lstm_layersize]),
end_fn=_end_fn
)

Resources