Is it possible to get dataset file infromation at the time of test a model? - python-3.x

My dataset code is like the below one; Here, X_test is a list[list] and y_test is list[Path]
The first.py file
self.test_dataset = LongDataset(
X_path=X_test,
y_path=y_test,
transform=val_transforms,
)
The rest of the part is as usual (dataloader)
def test_dataloader(self):
return DataLoader(self.test_dataset, batch_size=1, num_workers=8)
In the second.py file
The DataModule
data_module = DataModuleLong(batch_size=3,)
The Trainer
trainer = Trainer(gpus=1)
trainer.test(
model=model,
ckpt_path=ckpt_path,
datamodule=data_module,
)
The train_step() in the third.py file
def test_step(self, batch, batch_idx: int):
inputs, targets = batch
logits = self(inputs)
...
...
...
Now, is it possible to print (in the train_step()) the (inputs, targets) filename (or the full path) I am sending from test_dataset as (X_path, y_path)?

Essentially, what you want to do is get the index of each batch element in the batch returned by the dataloader object (from there it is trivial to index the dataset to get the desired data elements (in this case file paths).
Now the short answer is that there is no directly implemented way to return this data using the dataloader. However, there are a few workarounds:
Pass your own BatchSampler or Sampler object to the DataLoader constructor. Unfortunately there's not a simple way to query the Sampler for the current batch because it relies on generators (where yielding the next sample clears it and loades the next one. This is the same reason why you can't directly access the batch indices of the Dataloader. So to use this method, you'd have to pass a sampler wherein you know a priori which indices will be returned on the i-th query to the sampler. Not an ideal solution.
Create a custom dataset object - this is actually extremely easy to do, simply inherit from the torch.data object and implement the __init__, __len__ and __getitem__ methods. The __getitem__ method takes an index (let's say idx) as input and returns that index of the dataset. You can essentially copy the code for the existing LongDataset line for line, but simply append idx to the returned values from the __getitem__ method. I would demonstrate but you don't indicate where the LongDataset code comes from.
def __getitem__(self,idx):
... #load files, preprocess, etc.
return data, idx
Now dataloader will automatically zip the idx values for each, so you can simply replace the existing line with:
inputs, targets, indices = batch
data_paths = [self.test_dataset.file_paths[idx] for idx in indices]
The second solution is by far preferable as it is more transparently easy to understand.

Related

Customise train_step in model.fit() "OperatorNotAllowedInGraphError: iterating over `tf.Tensor` is not allowed: AutoGraph did convert this function"

I am trying to write a custom train_step to use in the tf.keras.Model.fit() function. I am following tensor flow tutorial. Here in the train_step function from what I understand the input argument data is supposed to be the training dataset that I am about to pass in Model.fit() function. My dataset is TFRecordDataset. My dataset gives three particular features i.e. image, labels and the box. So, in the train_step function i am first trying to get the img, labels and box parameters from the data argument that is passed.
def train_step(self, data):
print("printing data fed to train_step")
print(data)
img, label, gt_boxes = data
if self.DEBUG:
if(img == None):
print("img input in train step is none")
with tf.GradientTape() as tape:
rpn_classification, rpn_regression = self(img, training=True)
self.tf_rpn_target_generation_layer(gt_boxes, rpn_regression)
loss = self.rpn_loss_function(rpn_classification)
trainable_vars = self.trainable_variables
gradients = tape.gradient(loss, trainable_vars)
self.optimizer.apply_gradients(zip(gradients, trainable_vars))
loss_tracker.update_state(loss)
#mae_metric.update_state()
return [loss_tracker]
The above is the code I use for my custom train_step function. When I run the fit, I get the following error
OperatorNotAllowedInGraphError: iterating over tf.Tensor is not allowed: AutoGraph did convert this function. This might indicate you are trying to use an unsupported feature.
I have used shuffle, cache, and repeat operations on my training dataset. Can anyone please help me understand why exactly this error appears?
From my previous experience, i generally create an iterator for the dataset followed by get_next operation to obtain the features.
Edit:
I have tried the following procedures but did not yield any outcome
Since the data being sent into the train_step is a dataset object, I have used tf.raw_ops.IteratorGetNext method to access the elements of the iterator which returned an error saying
"TypeError: Input 'iterator' of 'IteratorGetNext' Op has type string that does not match the expected type of resource."
To fix this error, I have assumed that it was likely tensorflow returning iterator graph and hence unable to access the elements, so I have added run_eagerly=True argument to the model.compile() function which returned gibberish being printed and the same error.
Epoch 1/5
printing data fed to train_step
Tensor("Shape:0", shape=(0,), dtype=int32)
Tensor("IteratorGetNext:0", shape=(), dtype=string)
I have found the solution. The data that is being passed to my step function is an iterator and hence I have to use tf.raw_ops.IteratorGetNext method to access the contents of the iterator.
When doing this I initially got another error saying that the iterator type does not match the expected type of resource and when debugged carefully I understood that the read_tfrecords mapping that I had to do to the dataset was unsuccessful and that lead to the dataset still containing unmapped tfrecords of format tf.string which is not an expected type of resource for the train_Step.

MultiOutput Classification with TensorFlow Extended (TFX)

I'm quite new to TFX (TensorFlow Extended), and have been going through the sample tutorial on the TensorFlow portal to understand a bit more to apply it to my dataset.
In my scenario, instead of predicting a single label, the problem at hand requires me to predict 2 outputs (category 1, category 2).
I've done this using pure TensorFlow Keras Functional API and that works fine, but then am now looking to see if that can be fitted into the TFX pipeline.
Where i get the error, is at the Trainer stage of the pipeline, and where it throws the error is in the _input_fn, and i suspect it's because i'm not correctly splitting out the given data into (features, labels) tensor pair in the pipeline.
Scenario:
Each row of the input data comes in the form of
[Col1, Col2, Col3, ClassificationA, ClassificationB]
ClassificationA and ClassificationB are the categorical labels which i'm trying to predict using the Keras Functional Model
The output layer of the keras functional model looks like below, where there's 2 outputs that is joined to a single dense layer (Note: _xf appended to the end is just to illustrate that i've encoded the classes to int representations)
output_1 = tf.keras.layers.Dense(
TargetA_Class, activation='sigmoid',
name = 'ClassificationA_xf')(dense)
output_2 = tf.keras.layers.Dense(
TargetB_Class, activation='sigmoid',
name = 'ClassificationB_xf')(dense)
model = tf.keras.Model(inputs = inputs,
outputs = [output_1, output_2])
In the trainer module file, i've imported the required packages at the start of the module file >
import tensorflow_transform as tft
from tfx.components.tuner.component import TunerFnResult
import tensorflow as tf
from typing import List, Text
from tfx.components.trainer.executor import TrainerFnArgs
from tfx.components.trainer.fn_args_utils import DataAccessor, FnArgs
from tfx_bsl.tfxio import dataset_options
The current input_fn in the trainer module file looks like the below (by following the tutorial)
def _input_fn(file_pattern: List[Text],
data_accessor: DataAccessor,
tf_transform_output: tft.TFTransformOutput,
batch_size: int = 200) -> tf.data.Dataset:
"""Helper function that Generates features and label dataset for tuning/training.
Args:
file_pattern: List of paths or patterns of input tfrecord files.
data_accessor: DataAccessor for converting input to RecordBatch.
tf_transform_output: A TFTransformOutput.
batch_size: representing the number of consecutive elements of returned
dataset to combine in a single batch
Returns:
A dataset that contains (features, indices) tuple where features is a
dictionary of Tensors, and indices is a single Tensor of label indices.
"""
return data_accessor.tf_dataset_factory(
file_pattern,
dataset_options.TensorFlowDatasetOptions(
batch_size=batch_size,
#label_key=[_transformed_name(x) for x in _CATEGORICAL_LABEL_KEYS]),
label_key=_transformed_name(_CATEGORICAL_LABEL_KEYS[0]), _transformed_name(_CATEGORICAL_LABEL_KEYS[1])),
tf_transform_output.transformed_metadata.schema)
When i run the trainer component the error that comes up is:
label_key=_transformed_name(_CATEGORICAL_LABEL_KEYS[0]),transformed_name(_CATEGORICAL_LABEL_KEYS1)),
^ SyntaxError: positional argument follows keyword argument
I've also tried label_key=[_transformed_name(x) for x in _CATEGORICAL_LABEL_KEYS]) which also gives an error.
However, if i just pass in a single label key, label_key=transformed_name(_CATEGORICAL_LABEL_KEYS[0]) then it works fine.
FYI - _CATEGORICAL_LABEL_KEYS is nothing but a list which contains the names of the 2 outputs i'm trying to predict (ClassificationA, ClassificationB).
transformed_name is nothing but a function to return an updated name/key for the transformed data:
def transformed_name(key):
return key + '_xf'
Question:
From what i can see, the label_key argument for dataset_options.TensorFlowDatasetOptions can only accept a single string/name of label, which means it may not be able to output the dataset with multi labels.
Is there a way which i can modify the _input_fn so that i can get the dataset that's returned by _input_fn to work with returning the 2 output labels? So the tensor that's returned looks something like:
Feature_Tensor: {Col1_xf: Col1_transformedfeature_values, Col2_xf:
Col2_transformedfeature_values, Col3_xf:
Col3_transformedfeature_values}
Label_Tensor: {ClassificationA_xf: ClassA_encodedlabels,
ClassificationB_xf: ClassB_encodedlabels}
Would appreciate advice from the wider community of tfx!
Since the label key is optional, maybe instead of specifying it in the TensorflowDatasetOptions, instead you can use dataset.map afterwards and pass both labels after taking them from your dataset.
Haven't tested it but something like:
def _data_augmentation(feature_dict):
features = feature_dict[_transformed_name(x) for x in
_CATEGORICAL_FEATURE_KEYS]]
keys=[_transformed_name(x) for x in _CATEGORICAL_LABEL_KEYS]
return features, keys
def _input_fn(file_pattern: List[Text],
data_accessor: DataAccessor,
tf_transform_output: tft.TFTransformOutput,
batch_size: int = 200) -> tf.data.Dataset:
"""Helper function that Generates features and label dataset for tuning/training.
Args:
file_pattern: List of paths or patterns of input tfrecord files.
data_accessor: DataAccessor for converting input to RecordBatch.
tf_transform_output: A TFTransformOutput.
batch_size: representing the number of consecutive elements of returned
dataset to combine in a single batch
Returns:
A dataset that contains (features, indices) tuple where features is a
dictionary of Tensors, and indices is a single Tensor of label indices.
"""
dataset = data_accessor.tf_dataset_factory(
file_pattern,
dataset_options.TensorFlowDatasetOptions(
batch_size=batch_size,
tf_transform_output.transformed_metadata.schema)
dataset = dataset.map(_data_augmentation)
return dataset

How to include multiple input tensor in keras.model.fit_generator

I am a keras rookie and I need some help in working with keras after many days struggling at this problem. Please ask for further information if there is any ambiguity.
Currently, I am trying to modify the code from a link.According to their network model, there are 2 input tensors expected. Now I have trouble including 2 input tensors into the source code provided by them.
Function Boneage_prediction_model() initiates a model of 2 input tensors.
def Boneage_prediction_model():
i1 = Input(shape=(500, 500, 1), name='input_img') # the 1st input tensor
i2 = Input(shape=(1,), name='input_gender') # the 2nd input tensor
... ...
model = Model(inputs=(i1, i2), outputs=o) # define model input
with both i1 and i2
... ...
#using model.fit_generator to instantiate
# datagen is initiated by keras.preprocessing.image.ImageDataGenerator
# img_train is the 1st network input, and boneage_train is the training label
# gender_train is the 2nd network input
model.fit_generator(
(datagen.flow(img_train, boneage_train, batch_size=10),
gender_train),
... ...
)
I tried many ways to combine the two (datagen.flow(img_train, boneage_train, batch_size=10) and gender_train) as stated above, but it failed and kept reporting errors
such as the following,
ValueError: Error when checking model input: the list of Numpy arrays that you are passing to your model is not the size the model expected. Expected to see 2 array(s), but instead got the following list of 1 arrays: [array([[[[-0.26078433],
[-0.26078433],
[-0.26078433],
...,
[-0.26078433],
[-0.26078433],
[-0.26078433]],
[[-0.26078433],
[-0.26...
If I understand you correctly, you want to have two inputs for one network and have one label for the combined output. In the official documentation for the fit_generator there is an example with multiple inputs.
Using a dictionary to map the multiple inputs would result in:
model.fit_generator(
datagen.flow({'input_img':img_train, 'input_gender':gender_train}, boneage_train, batch_size=10),
...
)
After failure either blindly to simply combine the 2 inputs, or as another contributor suggested, to use a dictionary to map the multiple inputs, I realized it seems to be the problem of datagen.flow which keeps me from combining a image tensor input and a categorical tensor input. datagen.flow is initiated by keras.preprocessing.image.ImageDataGenerator with the goal of preprocessing the input images. Therefore chances are that it is inappropriate to combine the 2 inputs inside datagen.flow. Additionally, fit_generator seems to expect an input of generator type, and what I did as proposed in my question is wrong, though I do not fully understand the mechanism of this function.
As I looked up carefully in other codes written by the team, I learned that I need to write a generator to combine the two. The solution is as following,
def combined_generators(image_generator, gender_data, batch_size):
gender_generator = cycle(batch(gender_data, batch_size))
while True:
nextImage = next(image_generator)
nextGender = next(gender_generator)
assert len(nextImage[0]) == len(nextGender)
yield [nextImage[0], nextGender], nextImage[1]
def batch(iterable, n=1):
l = len(iterable)
for ndx in range(0, l, n):
yield iterable[ndx:min(ndx + n, l)]
train_gen_wrapper = combined_generators(train_gen_boneage, train_df_boneage['male'], BATCH_SIZE_TRAIN)
model.fit_generator(train_gen_wrapper, ... )

how to create my own scoring in GridsearchCV?

I want to create my own scoring in GridsearchCV, below is my code:
when I run these code, error happens in the last phrase: grid_x.fit(train_x_pca, x_ref). When I using the in-build scoring like 'r2'
grid_x=GridSearchCV(nnw_model, para_grid, scoring='r2'), it works.
There should be something wrong in my own scoring def.
nnw_model=MLPRegressor(hidden_layer_sizes=(15,), activation='tanh', \
solver='lbfgs', learning_rate='adaptive', max_iter=1000,\
learning_rate_init=0.01, alpha=0.01)
para_grid=[{'activation': ['tanh', 'logistic', 'relu'], 'hidden_layer_sizes':\
[(15,), (17,), (19,), (21,)], 'learning_rate_init':\
[0.01,0.001,0.0001]}]
x_ref=ocd_ref['tilt_x']
def Rsq_x_cal(train_x_pca, x_ref):
nnw_model_x.fit(train_x_pca, x_ref)
train_x_out=nnw_model_x.predict(train_x_pca)
metric_x=linregress(train_x_out, x_ref)
rsq_x=metric_x[2]**2
return rsq_x
rsq_x_value=make_scorer(Rsq_x_cal, greater_is_better=True)
grid_x=GridSearchCV(nnw_model, para_grid, scoring=rsq_x_value)
grid_x.fit(train_x_pca, x_ref)
scoring takes a callable with actual outputs and predicted outputs as the parameters and returns a single number.
Something like this:
def my_scoring(y_actual, y_predicted):
score = do_something_on_input()
return score
In your custom method, you are passing the features (train_x_pca) and labels (x_ref) as input params. Thats the source of error. GridSearchCV will not pass the training data to your method. It will pass already predicted data to it. So your these two lines are unnecessary:
nnw_model_x.fit(train_x_pca, x_ref)
train_x_out=nnw_model_x.predict(train_x_pca)
...
Just do this:
# Notice that I changed the order of x_ref here from your
# And deleted those two unwanted lines
def Rsq_x_cal(x_ref, train_x_out):
metric_x=linregress(train_x_out, x_ref)
rsq_x=metric_x[2]**2
return rsq_x
Now assuming that linregress returns the score between actual and predicted data, the above code should work

how to provide a extra target argument to input_fn of tf.estimator

As you know, in order to utilize tf.estimator, one needs to implement the model function builds a pipeline that yields batches of (features, labels) pairs, therefore the signature should be as following:
model_fn(features, labels, mode, params, config):
These features and labels should be returned from the input_fn. We assume that features -> X, and labels-> y, I am having a problem here because I have two type of labels.(targets, labels)
Features = X : [None, 2048]
Labels = targets: [None, 2048]
labels: [None, 1]
In order to provide targets and labels as separate arguments instead of just one label argument, what would be the alternative?
Note: I tried to concatenate targets and labels, then slice them where it needs but it created an additional problem during execution of the model. Therefore I am wondering whether you guys have any other better ideas or not?
Thank you.
In your input_fn, you can simply return a dictionary instead of a tensor as labels. That is, your input function likely returns an iterator over a tuple (features, labels). Both features and labels can either be a single tensor or a dict. This dict should map from strings to tensors.
You can prepare the dataset as one returning three elements (features, targets, labels), and then include a mapping to pack the targets into a dict (there might be better ways but this works):
data = ... # prepare dataset of 3-tuples
def pack_in_dict(features, targets, labels):
return features, {"targets": targets, "labels": labels}
data = data.map(pack_in_dict)
Now, if one of the elements is a dict (say, labels), then the corresponding input to model_fn will also be a dict. You can then simply use labels["targets"] and labels["labels"] in your model_fn.

Resources