How keras.utils.Sequence works? - python-3.x

I am trying to create a data pipeline for U-net for Image Segmentation. I came across Keras.utils.Sequence class through which, I can create a data pipeline, But I am unable to understand how this is working.
link for the code Keras code , Source code
def __iter__(self):
"""Create a generator that iterate over the Sequence."""
for item in (self[i] for i in range(len(self))):
yield item
I will highly appreciate if anyone can tell me how this works ?

You don't need a generator. The sequence class is there to manage that. You need to define a class inherited from tensorflow.keras.utils.Sequence and define the methods:
__init__, __getitem__, __len__. In addition, you can define the method on_epoch_end, which is called at the end of each epoch and is usually used to shuffle the sample indexes.
There is an example in the link you gave Tensorflow Sequence.
Below is another example of Sequence.
Note that you can pass the data to the __init__ constructor, but you may as well read the data from files in the __getitem__ method, assuming you know where to read it, e.g. by passing the name of a directory or directories into the constructor. This is necessary if there is a lot of data.
from tensorflow import keras
import numpy as np
class SequenceExample(keras.utils.Sequence):
def __init__(self, x_in, y_in, batch_size, shuffle=True):
# Initialization
self.batch_size = batch_size
self.shuffle = shuffle
self.x = x_in
self.y = y_in
self.datalen = len(y_in)
self.indexes = np.arange(self.datalen)
if self.shuffle:
np.random.shuffle(self.indexes)
def __getitem__(self, index):
# get batch indexes from shuffled indexes
batch_indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]
x_batch = self.x[batch_indexes]
y_batch = self.y[batch_indexes]
return x_batch, y_batch
def __len__(self):
# Denotes the number of batches per epoch
return self.datalen // self.batch_size
def on_epoch_end(self):
# Updates indexes after each epoch
self.indexes = np.arange(self.datalen)
if self.shuffle:
np.random.shuffle(self.indexes)

Related

Changing the checkpoint path of lr_find

I want to tune the learning rate for my PyTorch Lightning model. My code runs on a GPU cluster, so I can only write to certain folders that I bind mount. However, trainer.tuner.lr_find tries to write the checkpoint to the folder where my script runs and since this folder is not writable, it fails with the following error:
OSError: [Errno 30] Read-only file system: '/opt/xrPose/.lr_find_43df1c5c-0aed-4205-ac56-2fe4523ca4a7.ckpt'
Is there anyway to change the checkpoint path for lr_find? I checked the documentation but I couldn't find any information on that, in the part related to checkpointing.
My code is below:
res = trainer.tuner.lr_find(model, train_dataloaders=train_dataloader, val_dataloaders=val_dataloader, min_lr=1e-5)
logging.info(f"suggested learning rate: {res.suggestion()}")
model.hparams.learning_rate = res.suggestion()
You may need to specify default_root_dir when initialize Trainer:
trainer = Trainer(default_root_dir='./my_dir')
Description from the Official Documentation:
default_root_dir - Default path for logs and weights when no logger or
pytorch_lightning.callbacks.ModelCheckpoint callback passed.
Code example:
import numpy as np
import torch
from pytorch_lightning import LightningModule, Trainer
from torch.utils.data import DataLoader, Dataset
class MyDataset(Dataset):
def __init__(self) -> None:
super().__init__()
def __getitem__(self, index):
x = np.zeros((10,), np.float32)
y = np.zeros((1,), np.float32)
return x, y
def __len__(self):
return 100
class MyModel(LightningModule):
def __init__(self):
super().__init__()
self.model = torch.nn.Linear(10, 1)
def forward(self, x):
return self.model(x)
def training_step(self, batch, batch_idx):
x, y = batch
y_hat = self(x)
loss = torch.nn.MSELoss()(y_hat, y)
return loss
def configure_optimizers(self):
return torch.optim.Adam(self.parameters(), lr=0.02)
model = MyModel()
trainer = Trainer(default_root_dir='./my_dir')
train_dataloader = DataLoader(MyDataset())
trainer.tuner.lr_find(model, train_dataloader)
As it is defined in the lr_finder.py as:
# Save initial model, that is loaded after learning rate is found
ckpt_path = os.path.join(trainer.default_root_dir, f".lr_find_{uuid.uuid4()}.ckpt")
trainer.save_checkpoint(ckpt_path)
The only way of changing the directory for saving the checkpoint is to change the default_root_dir. But be aware that this is also the directory that the lightning logs are saved to.
You can easily change it with trainer = Trainer(default_root_dir='./NAME_OF_THE_DIR').

torch - subsample each dataset differently and concatenate them

I have two datasets, but one is larger than the other and I want to subsample it (resample in each epoch).
I probably cannot use dataloader argument sampler, as I would pass to Dataloader the already concatenated dataset.
How do I achieve this simply?
I think one solution would be to write a class SubsampledDataset(IterableDataset) which would resample every time __iter__ is called (each epoch).
(Or better use a map-style dataset, but is there a hook that gets called every epoch, like __iter__ gets?)
This is what I have so far (untested). Usage:
dataset1: Any = ...
# subsample original_dataset2, so that it is equally large in each epoch
dataset2 = RandomSampledDataset(original_dataset2, num_samples=len(dataset1))
concat_dataset = ConcatDataset([dataset1, dataset2])
data_loader = torch.utils.data.DataLoader(
concat_dataset,
sampler=RandomSamplerWithNewEpochHook(dataset2.new_epoch_hook, concat_dataset)
)
The result is that the concat_dataset will be shuffled each epoch (RandomSampler), in addition, the dataset2 component is a new sample of the (possibly larger) original_dataset2, different in each epoch.
You can add more datasets to be subsampled by doing instead of:
sampler=RandomSamplerWithNewEpochHook(dataset2.new_epoch_hook
this:
sampler=RandomSamplerWithNewEpochHook(lambda: dataset2.new_epoch_hook and dataset3.new_epoch_hook and dataset4.new_epoch_hook, ...
Code:
class RandomSamplerWithNewEpochHook(RandomSampler):
""" Wraps torch.RandomSampler and calls supplied new_epoch_hook before each epoch. """
def __init__(self, new_epoch_hook: Callable, data_source: Sized, replacement: bool = False,
num_samples: Optional[int] = None, generator=None):
super().__init__(data_source, replacement, num_samples, generator)
self.new_epoch_hook = new_epoch_hook
def __iter__(self):
self.new_epoch_hook()
return super().__iter__()
class RandomSampledDataset(Dataset):
""" Subsamples a dataset. The sample is different in each epoch.
This helps when concatenating datasets, as the subsampling rate can be different for each dataset.
Call new_epoch_hook before each epoch. (This can be done using e.g. RandomSamplerWithNewEpochHook.)
This would be arguably harder to achieve with a concatenated dataset and a sampler argument to Dataloader. The
sampler would have to be aware of the indices of subdatasets' items in the concatenated dataset, of the subsampling
for each subdataset."""
def __init__(self, dataset, num_samples, transform=lambda im: im):
self.dataset = dataset
self.transform = transform
self.num_samples = num_samples
self.sampler = RandomSampler(dataset, num_samples=num_samples)
self.current_epoch_samples = None
def new_epoch_hook(self):
self.current_epoch_samples = torch.tensor(iter(self.sampler), dtype=torch.int)
def __len__(self):
return self.num_samples
def __getitem__(self, item):
if item < 0 or item >= len(self):
raise IndexError
img = self.dataset[self.current_epoch_samples[item].item()]
return self.transform(img)
You can stop to iterate by raising StopIteration. This error is caught by Dataloader and simply stop the iteration. So you can do something like that:
class SubDataset(Dataset):
"""SubDataset class."""
def __init__(self, dataset, length):
self.dataset = dataset
self.elem = 0
self.length = length
def __getitem__(self, index):
self.elem += 1
if self.elem > self.length:
self.elem = 0
raise StopIteration # caught by DataLoader
return self.dataset[index]
def __len__(self):
return len(self.dataset)
if __name__ == '__main__':
torch.manual_seed(0)
dataloader = DataLoader(SubDataset(torch.arange(10), 5), shuffle=True)
for _ in range(3):
for x in dataloader:
print(x)
print(len(dataloader)) # 10!!
Output:
Note that setting __len__ to self.length will cause a problem because dataloader will use only indices between 0 and length-1 (that is not what you want). Unfortunately I found nothing to set the actually length without having this behaviour (due to Dataloader restriction). Thus be careful: len(dataset) is the original length and dataset.length is the new length.

Visualize the output of Vgg16 model by TSNE plot?

I need to visualize the output of Vgg16 model which classify 14 different classes.
I load the trained model and I did replace the classifier layer with the identity() layer but it doesn't categorize the output.
Here is the snippet:
the number of samples here is 1000 images.
epoch = 800
PATH = 'vgg16_epoch{}.pth'.format(epoch)
checkpoint = torch.load(PATH)
model.load_state_dict(checkpoint['model_state_dict'])
optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
epoch = checkpoint['epoch']
class Identity(nn.Module):
def __init__(self):
super(Identity, self).__init__()
def forward(self, x):
return x
model.classifier._modules['6'] = Identity()
model.eval()
logits_list = numpy.empty((0,4096))
targets = []
with torch.no_grad():
for step, (t_image, target, classess, image_path) in enumerate(test_loader):
t_image = t_image.cuda()
target = target.cuda()
target = target.data.cpu().numpy()
targets.append(target)
logits = model(t_image)
print(logits.shape)
logits = logits.data.cpu().numpy()
print(logits.shape)
logits_list = numpy.append(logits_list, logits, axis=0)
print(logits_list.shape)
tsne = TSNE(n_components=2, verbose=1, perplexity=10, n_iter=1000)
tsne_results = tsne.fit_transform(logits_list)
target_ids = range(len(targets))
plt.scatter(tsne_results[:,0],tsne_results[:,1],c = target_ids ,cmap=plt.cm.get_cmap("jet", 14))
plt.colorbar(ticks=range(14))
plt.legend()
plt.show()
here is what this script has been produced: I am not sure why I have all colors for each cluster!
The VGG16 outputs over 25k features to the classifier. I believe it's too much to t-SNE. It's a good idea to include a new nn.Linear layer to reduce this number. So, t-SNE may work better. In addition, I'd recommend you two different ways to get the features from the model:
The best way to get it regardless of the model is by using the register_forward_hook method. You may find a notebook here with an example.
If you don't want to use the register, I'd suggest this one. After loading your model, you may use the following class to extract the features:
class FeatNet (nn.Module):
def __init__(self, vgg):
super(FeatNet, self).__init__()
self.features = nn.Sequential(*list(vgg.children())[:-1]))
def forward(self, img):
return self.features(img)
Now, you just need to call FeatNet(img) to get the features.
To include the feature reducer, as I suggested before, you need to retrain your model doing something like:
class FeatNet (nn.Module):
def __init__(self, vgg):
super(FeatNet, self).__init__()
self.features = nn.Sequential(*list(vgg.children())[:-1]))
self.feat_reducer = nn.Sequential(
nn.Linear(25088, 1024),
nn.BatchNorm1d(1024),
nn.ReLU()
)
self.classifier = nn.Linear(1024, 14)
def forward(self, img):
x = self.features(img)
x_r = self.feat_reducer(x)
return self.classifier(x_r)
Then, you can run your model returning x_r, that is, the reduced features. As I told you, 25k features are too much for t-SNE. Another method to reduce this number is by using PCA instead of nn.Linear. In this case, you send the 25k features to PCA and then train t-SNE using the PCA's output. I prefer using nn.Linear, but you need to test to check which one you get a better result.

Pass user specified parameters to DataLoader

I am using U - Net and implementing the weighting technique described in the papers from 2015 (U-Net: Convolutional Networks for Biomedical
Image Segmentation) and 2019 (U-Net – Deep Learning for Cell Counting, Detection, and Morphometry). In that technique there is a variance σ and a weight w_0. I would like, especially the σ, to be a learnable parameter instead of guessing which value is best from dataset to dataset.
From what I found, I can do this using nn.Parameter.
To use the learned σ from epoch to epoch, I need somehow to pass this new value to the get_item function of the DataSet through the DataLoader.
My current take on this, is to extend torch.utils.data.DataLoader where the new init has an extra parameter accepting the user specified/learnable parameters. Given the source code of torch.utils.data.DataLoader, I do not understand where and how the DataLoader calls the DataSet instance and hence to pass these parameters.
Code wise, in the DataSet definition there is the function
def __getitem__(self, index):
that I can change as
def __getitem__(self, index, sigma):
and to make use of the updated, newly learned σ.
My problem is that during training, I iterate through training dataset as
for epoch in range( checkpoint[ 'epoch'], num_epochs):
....
for ii, ( X, y, y_weight, fname) in enumerate( dataLoader[ phase]):
In that enumeration of DataLoader, how can I pass the new σ to the DataLoader such that the DataLoader will pass it to the DataSet getitem function mentioned above?
EDIT
Currently, I define inside the DataSet class a parameter sigma
class MedicalImageDataset( Dataset):
def __init__(self, fname, img_transform = None, mask_transform = None, weight_transform = None, sigma = 8):
...
self.sigma = sigma
def __getitem__(self, index):
sigma = self.sigma
...
which I update through the DataLoader as
dataLoader[ 'train'].dataset.sigma = model.sigma
where,
model.sigma
is a custom parameter defined as
model.register_parameter( name = 'sigma', param = torch.nn.Parameter( torch.tensor( 16, dtype = torch.float16), requires_grad = True))
after creating the model.
My problem is, that model.sigma doesn't look being updated from epoch to epoch. Specifically, is the same as the initial value. Why is this?
Having a look at optimizer.state_dict() I couldn't find any parameter named 'sigma', whereas I can find one in model.named_parameters().
Finally, this parameter sigma is not attached to any layer, it's kinda "free".
What you need to do is to set sigma as an attribute of the Dataset and change it between epochs.
For the dataset definition
class UNetDataset(object):
def __init__(self, ..., sigma=5):
self.sigma = sigma
Now, within __getitem__, you can use the sigma value using self.sigma
Now within your training cycle, after every epoch, you can change the sigma value by setting the sigma attribute of the Dataset
for epoch in range(num_epochs):
dataset.sigma = #whatever value you want
for i,(x,y) in enumarate(DataLoader):

Pytorch Data Wont Fit in Memory - Example?

I am trying to find an example of training in Pytorch in batch from data on disk - akin to the Keras fit_generator. How would I alter the code below to read the csv from disk instead of loading it to memory?
I have found that one can iterate over a custom data loader like below, but I am unsure how to do this without loading all the data in memory.
I would like to:
Train and validate the model with data held on disk
Use mini-batches of the full data on disk
Repeat x epochs
class testLoader(Dataset):
def __init__(self):
#regular old numpy
boston = load_boston()
x=boston.data
y=boston.target
self.x = torch.from_numpy(x)
self.y = torch.from_numpy(y)
self.length = x.shape[0]
self.vars =x.shape[1]
def __getitem__(self, index):
return self.x[index], self.y[index]
def __len__(self):
return self.length
training_samples=testLoader()
train_loader = utils_data.DataLoader(training_samples, batch_size=64, shuffle=True)

Resources