Change image labels when using pytorch - pytorch

I am loading an image dataset with pytorch as seen below:
dataset = datasets.ImageFolder('...', transform=transform)
loader = DataLoader(dataset, batch_size=args.batchsize)
The dataset is i na folder with structure as seen below:
dataset/
class_1/
class_2/
class_3/
So in result each image in class_1 folder has a label of 0..etc.
However i would like to change these labels and randomly assign a label to each image in the dataset. What i tried is:
new_labels = [random.randint(0, 3) for i in range(len(dataset.targets))]
dataset.targets = new_labels
This however does not change the labels as i wanted due to some errors later in model training.
Is this the correct way to do it or is tehre a more appropriate one?

You can have a transformation for the labels:
import random
class rand_label_transform(object):
def __init__(self, num_labels):
self.num_labels = num_labels
def __call__(self, labels):
# generate new random label
new_label = random.randint(0, self.num_labels - 1)
return new_label
dataset = datasets.ImageFolder('...', transform=transform, target_transform=rand_label_transform(num_labels=3))
See ImageFolder for more details.

Related

Tensorflow : How to retrieve parts of my multi input Dataset and their respective loss?

First of all i am quite new regarding how AI and Tensorflow work.
My problem is the following : I need to train my neural network on 2 paired images. One that is unchanged and the same one that is transformed. This implies at the end a joint loss calculation of the paired images in order to calculate the mutual information for an unsupervised image analysis problem.
Also, since my dataset are 256*256 RGB images * 4 000 i need to use a data generator.
Here is an example of what i already did about my data generator:
class dataset(object):
def __init__(self, data_list, batch_size):
self.dataset = None
self.batch_size = BATCH_SIZE
self.current_batch = 0
self.data_list = data_list
self.normal_image = None
self.transformed_image = None
self.label = None
def generator(self):
index = self.current_batch * self.batch_size
self.current_batch = self.current_batch + 1
for image, label in self.data_list[index:]:
self.label = label
image = image / 255.0
self.normal_image = image
self.transformed_image = utils.get_random_crop(image, height = 200, width = 200)
yield ({'normal_image' : self.normal_image,
'transformed_image' : self.transformed_image},
{'label' : self.label})
def data_loader(self):
self.dataset = tf.data.Dataset.from_generator(self.generator,
output_types=(
{'normal_image' : tf.float32,
'transformed_image' : tf.float32},
{'label' : tf.int32})).batch(self.batch_size)
return self.dataset
train_dataset = dataset(train_list, BATCH_SIZE)
test_dataset = dataset(test_list, BATCH_SIZE)
Note that train_list & test_list are just raw numpy arrays that i have retrieved from my images collection.
Here are my 2 questions :
How can i retrieve specifically the loss from my normal & transformed images so that i can do a joint loss calculation at the end of each epoch ?
I got my data generator(seems to work fine) each next() retrieve the next batch of my collection. However as you can see i have a (kind of ?) tuple inside of my dataset {normal_image, transformed_image}.
I am having a hard time to find how to access specifically one of those data inside of this (kind of ?) tuple in order to feed my CNN with the normal_imageand the transformed_image one at the time ect...
dataset.transformed_image would have been too good Haha !
Also, in my dataset class i have a self.normal_image & self.transformed_image but i use them only for plotting. They are not tensors... like in my dataset :(
Thanks for your time !

What I missing here, using ImageFolder to get the full folder name as labels for MNIST-double dataset images?

I would like to use dataset.ImageFolder to create an Image Dataset.
My current image directory structure looks like this:
1: In train images, I have subfolders which are my labels contain 00, 01, and so on. In each folder, images contain double digits corresponding to each label
Here is the code I used followed by the output where the labels does not. match with the images
paths here
data_dir = "/home/mhamdan/hamdan/MNIST_muldigits/data/double_mnist"
train_dir = data_dir + '/train' # training_set contains training dataset
val_dir = data_dir + '/val' #contains validation dataset
test_dir = data_dir + '/test' #contains test dataset
Loading the data here
#Load the dataset with Image Folder
trainset = datasets.ImageFolder(train_dir, transform = transformation)
valset = datasets.ImageFolder(val_dir, transform = transformation)
testset = datasets.ImageFolder(test_dir, transform = transformation)
Data loaders
#define data loaders
batch_size = 32
train_loader = DataLoader(trainset, batch_size=batch_size, shuffle=True,num_workers=2)
val_loader = DataLoader(valset, batch_size=batch_size, shuffle=True,num_workers=2)
test_loader = DataLoader(testset, batch_size=batch_size,num_workers=1)
Here is the plotting of random training images
examples = enumerate(train_loader)
batch_idx, (example_data, example_targets) = next(examples)
import matplotlib.pyplot as plt
fig = plt.figure()
for i in range(6):
plt.subplot(2,3,i+1)
plt.tight_layout()
plt.imshow(example_data[i][0], cmap='gray', interpolation='none')
plt.title("Ground Truth: {}".format(example_targets[1]))
plt.xticks([])
plt.yticks([])
fig
As you see here, the labels are different than images
labels differ than images
Each subfolders contains a unique label associated with a label
here the images in 01 subdirectory
Last update after using the index.
I think the problem is in printing the labels,
Here is the plotting of random training images
For this, the code should be,
examples = enumerate(train_loader)
batch_idx, (example_data, example_targets) = next(examples)
import matplotlib.pyplot as plt
fig = plt.figure()
for i in range(6):
plt.subplot(2,3,i+1)
plt.tight_layout()
plt.imshow(example_data[i][0], cmap='gray', interpolation='none')
plt.title("Ground Truth: {}".format(example_targets[i]))
plt.xticks([])
plt.yticks([])
fig
In your code, it was example_targets[1]), instead of i.
Here is the solution to my question by taking the indexes as dictionary labelsdec = trainset.class_to_idx and by extracting the keys as labels/classes using this functions
def getList(dict):
list = []
for key in dict.keys():
list.append(key)
return list
def getList(dict):
list = []
for key in dict.keys():
list.append(key)
return list
classes = getList(labelsdec)
Thin plotting 10 images:
def imshow(img):
img = img / 2 + 0.5 # unnormalize
plt.imshow(np.transpose(img, (1, 2, 0))) # convert from Tensor image
# obtain one batch of training images
data_iter = iter(train_loader)
images, lbls = data_iter.next()
images = images.numpy() # convert images to numpy for display
# plot the images in the batch, along with the corresponding labels
fig = plt.figure(figsize=(10, 4))
# display 20 images
for idx in np.arange(10):
ax = fig.add_subplot(2, 10/2, idx+1, xticks=[], yticks=[])
imshow(images[idx])
label = lbls[idx]
ax.set_title(classes[lbls[idx]])
Here is how it looks see image

Applying a simple transformation to get a binary image using pytorch

I'd like to binarize image before passing it to the dataloader, I have created a dataset class which works well. but in the __getitem__() method I'd like to threshold the image:
def __getitem__(self, idx):
# Open image, apply transforms and return with label
img_path = os.path.join(self.dir, self.filelist[filename"])
image = Image.open(img_path)
label = self.x_data.iloc[idx]["label"]
# Applying transformation to the image
if self.transforms is not None:
image = self.transforms(image)
# applying threshold here:
my_threshold = 240
image = image.point(lambda p: p < my_threshold and 255)
image = torch.tensor(image)
return image, label
And then I tried to invoke the dataset:
data_transformer = transforms.Compose([
transforms.Resize((10, 10)),
transforms.Grayscale()
//transforms.ToTensor()
])
train_set = MyNewDataset(data_path, data_transformer, rows_train)
Since I have applied the threshold on a PIL object I need to apply afterwards a conversion to a tensor object , but for some reason it crashes. can somebody please assist me?
Why not apply the binarization after the conversion from PIL.Image to torch.Tensor?
class ThresholdTransform(object):
def __init__(self, thr_255):
self.thr = thr_255 / 255. # input threshold for [0..255] gray level, convert to [0..1]
def __call__(self, x):
return (x > self.thr).to(x.dtype) # do not change the data type
Once you have this transformation, you simply add it:
data_transformer = transforms.Compose([
transforms.Resize((10, 10)),
transforms.Grayscale(),
transforms.ToTensor(),
ThresholdTransform(thr_255=240)
])

Torchtext 0.7 shows Field is being deprecated. What is the alternative?

Looks like the previous paradigm of declaring Fields, Examples and using BucketIterator is deprecated and will move to legacy in 0.8. However, I don't seem to be able to find an example of the new paradigm for custom datasets (as in, not the ones included in torch.datasets) that doesn't use Field. Can anyone point me at an up-to-date example?
Reference for deprecation:
https://github.com/pytorch/text/releases
It took me a little while to find the solution myself. The new paradigm is like so for prebuilt datasets:
from torchtext.experimental.datasets import AG_NEWS
train, test = AG_NEWS(ngrams=3)
or like so for custom built datasets:
from torch.utils.data import DataLoader
def collate_fn(batch):
texts, labels = [], []
for label, txt in batch:
texts.append(txt)
labels.append(label)
return texts, labels
dataloader = DataLoader(train, batch_size=8, collate_fn=collate_fn)
for idx, (texts, labels) in enumerate(dataloader):
print(idx, texts, labels)
I've copied the examples from the Source
Browsing through torchtext's GitHub repo I stumbled over the README in the legacy directory, which is not documented in the official docs. The README links a GitHub issue that explains the rationale behind the change as well as a migration guide.
If you just want to keep your existing code running with torchtext 0.9.0, where the deprecated classes have been moved to the legacy module, you have to adjust your imports:
# from torchtext.data import Field, TabularDataset
from torchtext.legacy.data import Field, TabularDataset
Alternatively, you can import the whole torchtext.legacy module as torchtext as suggested by the README:
import torchtext.legacy as torchtext
There is a post regarding this. Instead of the deprecated Field and BucketIterator classes, it uses the TextClassificationDataset along with the collator and other preprocessing. It reads a txt file and builds a dataset, followed by a model. Inside the post, there is a link to a complete working notebook. The post is at: https://mmg10.github.io/pytorch/2021/02/16/text_torch.html. But you need the 'dev' (or nightly build) of PyTorch for it to work.
From the link above:
After tokenization and building vocabulary, you can build the dataset as follows
def data_to_dataset(data, tokenizer, vocab):
data = [(text, label) for (text, label) in data]
text_transform = sequential_transforms(tokenizer.tokenize,
vocab_func(vocab),
totensor(dtype=torch.long)
)
label_transform = sequential_transforms(lambda x: 1 if x =='1' else (0 if x =='0' else x),
totensor(dtype=torch.long)
)
transforms = (text_transform, label_transform)
dataset = TextClassificationDataset(data, vocab, transforms)
return dataset
The collator is as follows:
def __init__(self, pad_idx):
self.pad_idx = pad_idx
def collate(self, batch):
text, labels = zip(*batch)
labels = torch.LongTensor(labels)
text = nn.utils.rnn.pad_sequence(text, padding_value=self.pad_idx, batch_first=True)
return text, labels
Then, you can build the dataloader with the typical torch.utils.data.DataLoader using the collate_fn argument.
Well it seems like pipeline could be like that:
import torchtext as TT
import torch
from collections import Counter
from torchtext.vocab import Vocab
# read the data
with open('text_data.txt','r') as f:
data = f.readlines()
with open('labels.txt', 'r') as f:
labels = f.readlines()
tokenizer = TT.data.utils.get_tokenizer('spacy', 'en') # can remove 'spacy' and use a simple built-in tokenizer
train_iter = zip(labels, data)
counter = Counter()
for (label, line) in train_iter:
counter.update(tokenizer(line))
vocab = TT.vocab.Vocab(counter, min_freq=1)
text_pipeline = lambda x: [vocab[token] for token in tokenizer(x)]
# this is data-specific - adapt for your data
label_pipeline = lambda x: 1 if x == 'positive\n' else 0
class TextData(torch.utils.data.Dataset):
'''
very basic dataset for processing text data
'''
def __init__(self, labels, text):
super(TextData, self).__init__()
self.labels = labels
self.text = text
def __getitem__(self, index):
return self.labels[index], self.text[index]
def __len__(self):
return len(self.labels)
def tokenize_batch(batch, max_len=200):
'''
tokenizer to use in DataLoader
takes a text batch of text dataset and produces a tensor batch, converting text and labels though tokenizer, labeler
tokenizer is a global function text_pipeline
labeler is a global function label_pipeline
max_len is a fixed len size, if text is less than max_len it is padded with ones (pad number)
if text is larger that max_len it is truncated but from the end of the string
'''
labels_list, text_list = [], []
for _label, _text in batch:
labels_list.append(label_pipeline(_label))
text_holder = torch.ones(max_len, dtype=torch.int32) # fixed size tensor of max_len
processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int32)
pos = min(200, len(processed_text))
text_holder[-pos:] = processed_text[-pos:]
text_list.append(text_holder.unsqueeze(dim=0))
return torch.FloatTensor(labels_list), torch.cat(text_list, dim=0)
train_dataset = TextData(labels, data)
train_loader = DataLoader(train_dataset, batch_size=2, shuffle=False, collate_fn=tokenize_batch)
lbl, txt = iter(train_loader).next()

How to add new sample to CIFAR10 torchvision?

Hi I want to add my own images to the CIFAR10 dataset in torchvision, how can I do that?
train_data = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=train_transform)
train_data.add # or a workaround!
thanks
You can either create a custom dataset for CIFAR10, using the raw cifar10 images here or you can still use the CIFAR10 dataset inside your new custom dataset and then add your logic in the __getitem__() method.
This is a simple example to get you going :
class CIFAR10_2(torch.utils.data.Dataset):
def __init__(self, dataset_path='/cifar10', transformations=None, should_download=True):
self.dataset_train = torchvision.datasets.CIFAR10(dataset_path, download=should_download)
self.transformations = transformations
def __getitem__(self, index):
# do as you wish , add your logic here
(img, label) = self.dataset_train[index]
# for transformations for example
if self.transformations is not None:
return self.transformations(img), label
return img, label
def __len__(self):
return len(self.dataset_train)
you can get fancy and add logic for test,validation, etc and do what ever you like.

Resources