I am working on an image dataset where images are classified in 10 classes (CIFAR10 dataset). I am using PyTorch. Please, I would like to know how to determine the number of images per class by looping through the dataset. Thanks in advance for your response.
You can do it two-fold.
First create a dictionary img_dict which has all the classes of CIFAR10 dataset. Initialize all values to 0.
Next loop through the dataset and keep on incrementing the values against the class key in the img_dict
dataset = CIFAR10(root='data/', download=True, transform=ToTensor())
dataset_size = len(dataset)
classes = dataset.classes
num_classes = len(dataset.classes)
img_dict = {}
for i in range(num_classes):
img_dict[classes[i]] = 0
for i in range(dataset_size):
img, label = dataset[i]
img_dict[classes[label]] += 1
img_dict
You will get output as below:
number of images per class:
Here's a faster way to do it:
from collections import Counter
dataset = torchvision.datasets.CIFAR10(root=os.getcwd(), download=True, transform=ToTensor())
idx_to_class = {v:k for k, v in dataset.class_to_idx.items()} # Index to class mapping
idx_distribution = dict(Counter(dataset.targets))
class_distribution = {idx_to_class[k]:v for k, v in idx_distribution.items()}
print(class_distribution)
Result:
Result:
{
'frog': 5000,
'truck': 5000,
'deer': 5000,
'automobile': 5000,
'bird': 5000,
'horse': 5000,
'ship': 5000,
'cat': 5000,
'dog': 5000,
'airplane': 5000
}
Colab: https://colab.research.google.com/drive/11hHy0YNwgEhE5nmQqXRliE4KkIzqjBeN?usp=sharing
Related
First of all i am quite new regarding how AI and Tensorflow work.
My problem is the following : I need to train my neural network on 2 paired images. One that is unchanged and the same one that is transformed. This implies at the end a joint loss calculation of the paired images in order to calculate the mutual information for an unsupervised image analysis problem.
Also, since my dataset are 256*256 RGB images * 4 000 i need to use a data generator.
Here is an example of what i already did about my data generator:
class dataset(object):
def __init__(self, data_list, batch_size):
self.dataset = None
self.batch_size = BATCH_SIZE
self.current_batch = 0
self.data_list = data_list
self.normal_image = None
self.transformed_image = None
self.label = None
def generator(self):
index = self.current_batch * self.batch_size
self.current_batch = self.current_batch + 1
for image, label in self.data_list[index:]:
self.label = label
image = image / 255.0
self.normal_image = image
self.transformed_image = utils.get_random_crop(image, height = 200, width = 200)
yield ({'normal_image' : self.normal_image,
'transformed_image' : self.transformed_image},
{'label' : self.label})
def data_loader(self):
self.dataset = tf.data.Dataset.from_generator(self.generator,
output_types=(
{'normal_image' : tf.float32,
'transformed_image' : tf.float32},
{'label' : tf.int32})).batch(self.batch_size)
return self.dataset
train_dataset = dataset(train_list, BATCH_SIZE)
test_dataset = dataset(test_list, BATCH_SIZE)
Note that train_list & test_list are just raw numpy arrays that i have retrieved from my images collection.
Here are my 2 questions :
How can i retrieve specifically the loss from my normal & transformed images so that i can do a joint loss calculation at the end of each epoch ?
I got my data generator(seems to work fine) each next() retrieve the next batch of my collection. However as you can see i have a (kind of ?) tuple inside of my dataset {normal_image, transformed_image}.
I am having a hard time to find how to access specifically one of those data inside of this (kind of ?) tuple in order to feed my CNN with the normal_imageand the transformed_image one at the time ect...
dataset.transformed_image would have been too good Haha !
Also, in my dataset class i have a self.normal_image & self.transformed_image but i use them only for plotting. They are not tensors... like in my dataset :(
Thanks for your time !
I am working on an stl-10 image dataset that consists of 10 different classes. I want to reduce this multiclass image classification problem to the binary class image classification such as class 1 Vs rest. I am using PyTorch torchvision to download and use the stl data but I am unable to do it as one Vs the rest.
train_data=torchvision.datasets.STL10(root='data',split='train',transform=data_transforms['train'], download=True)
test_data=torchvision.datasets.STL10(root='data',split='test',transform=data_transforms['val'], download=True)
train_dataloader = DataLoader(train_data,batch_size = 64,shuffle=True,num_workers=2)
test_dataloader = DataLoader(test_data,batch_size = 64,shuffle=True,num_workers=2)
For torchvision datasets, there is an inbuilt way to do this. You need to define a transformation function or class and add that into the target_transform while creating the dataset.
torchvision.datasets.STL10(root: str, split: str = 'train', folds: Union[int, NoneType] = None, transform: Union[Callable, NoneType] = None, target_transform: Union[Callable, NoneType] = None, download: bool = False)
Here is a working example for reference :
import torchvision
from torch.utils.data import DataLoader
from torchvision import transforms
class Multi2UniLabelTfm():
def __init__(self,pos_label=5):
if isinstance(pos_label,int) or isinstance(pos_label,float):
pos_label = [pos_label,]
self.pos_label = pos_label
def __call__(self,y):
# if y==self.pos_label:
if y in self.pos_label:
return 1
else:
return 0
if __name__=='__main__':
test_tfms = transforms.Compose([
transforms.ToTensor()
])
data_transforms = {'val':test_tfms}
#Original Labels
# target_transform = None
# Label 5 is converted to 1. Rest are 0.
# target_transform = Multi2UniLabelTfm(pos_label=5)
# Labels 5,6,7 are converted to 1. Rest are 0.
target_transform = Multi2UniLabelTfm(pos_label=[5,6,7])
test_data=torchvision.datasets.STL10(root='data',split='test',transform=data_transforms['val'], download=True, target_transform=target_transform)
test_dataloader = DataLoader(test_data,batch_size = 64,shuffle=True,num_workers=2)
for idx,(x,y) in enumerate(test_dataloader):
print(idx,y)
if idx == 5:
break
You need to relabel the image. At the beginning, class 0 corresponds to label 0, class 1 corresponds to label 1, ..., and class 10 corresponds to label 9. If you want to achieve binary classification, you need to change the label of the picture of category 1 (or other) to 0, and the picture of all other categories to 1.
One way is to update label values at runtime before passing them to loss function in the training loop. Let's say we want to relabel class 5 as 1, and the rest as 0:
my_class_id = 5
for imgs, labels in train_dataloader:
labels = torch.where(labels == my_class_id, 1, 0)
...
You may also need to do similar relabeling for test_dataloader. Also, I am not sure about the datatype of labels. If its float, change accordingly.
In Pytorch, is there any way of loading a specific single sample using the torch.utils.data.DataLoader class? I'd like to do some testing with it.
The tutorial uses
trainloader = torch.utils.data.DataLoader(...)
images, labels = next(iter(trainloader))
to fetch a random batch of samples. Is there are way, using DataLoader, to get a specific sample?
Cheers
Turn off the shuffle in DataLoader
Use batch_size to calculate the batch in which the desired sample you are looking for falls in
Iterate to the desired batch
Code
import torch
import numpy as np
import itertools
X= np.arange(100)
batch_size = 2
dataloader = torch.utils.data.DataLoader(X, batch_size=batch_size, shuffle=False)
sample_at = 5
k = int(np.floor(sample_at/batch_size))
my_sample = next(itertools.islice(dataloader, k, None))
print (my_sample)
Output:
tensor([4, 5])
if you want to get a specific signle sample from your dataset you can
you should check Subset class.(https://pytorch.org/docs/stable/data.html#torch.utils.data.Subset)
something like this:
indices = [0,1,2] # select your indices here as a list
subset = torch.utils.data.Subset(train_set, indices)
trainloader = DataLoader(subset , batch_size = 16 , shuffle =False) #set shuffle to False
for image , label in trainloader:
print(image.size() , '\t' , label.size())
print(image[0], '\t' , label[0]) # index the specific sample
here is a useful link if you want to learn more about the Pytorch data loading utility
(https://pytorch.org/docs/stable/data.html)
I do binary text classification with BERT from the Simpletransformer.
I work in Colab with GPU runtime type.
I have generated train and test set with the sklearn StratifiedKFold Method. I have two files with the dictionaries containing my folds.
I run my classification in the following while loop:
from sklearn.metrics import matthews_corrcoef, f1_score
import sklearn
counter = 0
resultatos = []
while counter != len(trainfolds):
model = ClassificationModel('bert', 'bert-base-multilingual-cased',args={'num_train_epochs': 4, 'learning_rate': 1e-5, 'fp16': False,
'max_seq_length': 160, 'train_batch_size': 24,'eval_batch_size': 24 ,
'warmup_ratio': 0.0,'weight_decay': 0.00,
'overwrite_output_dir': True})
print("start with fold_{}".format(counter))
trainfolds["{}_fold".format(counter)].to_csv("/content/data/train.tsv", sep="\t", index = False, header=False)
print("{}_fold Train als train.tsv exportiert". format(counter))
testfolds["{}_fold".format(counter)].to_csv("/content/data/dev.tsv", sep="\t", index = False, header=False)
print("{}_fold test als train.tsv exportiert". format(counter))
train_df = pd.read_csv("/content/data/train.tsv", delimiter='\t', header=None)
eval_df = df = pd.read_csv("/content/data/dev.tsv", delimiter='\t', header=None)
train_df = pd.DataFrame({
'text': train_df[3].replace(r'\n', ' ', regex=True),
'label':train_df[1]})
eval_df = pd.DataFrame({
'text': eval_df[3].replace(r'\n', ' ', regex=True),
'label':eval_df[1]})
model.train_model(train_df)
result, model_outputs, wrong_predictions = model.eval_model(eval_df, f1 = sklearn.metrics.f1_score)
print(result)
resultatos.append(result)
shutil.rmtree("outputs")
shutil.rmtree("cache_dir")
#shutil.rmtree("runs")
counter += 1
And i get different Results Running this code for the same Folds:
Here for example the F1 Scores for two runs:
0.6237942122186495
0.6189111747851003
0.6172839506172839
0.632183908045977
0.6182965299684542
0.5942492012779553
0.6025641025641025
0.6153846153846154
0.6390532544378699
0.6627906976744187
The F1 Score is: 0.6224511646974427
0.6064516129032258
0.6282420749279539
0.6402439024390244
0.5971014492753622
0.6135693215339232
0.6191950464396285
0.6382978723404256
0.6388059701492537
0.6097560975609756
0.5956112852664576
The F1 Score is: 0.618727463283623
How can they be that diffeerent for the same folds?
What i tried already is give a fixed Random seed right before my loop starts:
random.seed(42)
np.random.seed(42)
torch.manual_seed(42)
torch.cuda.manual_seed_all(42)
I came up with approach of having the Model initialized in the loop because, when its outside the loop, it somehow remembers what it has learned - that means after the 2nd fold I get f1 score of almost one - despite the fact that i delete the cache..
I figured it out myself, just set all seeds plus torch.backends.cudnn.deterministic = True
and
torch.backends.cudnn.benchmark = False
like shown in this post and i get the same results for all runs!
I am trying to load imdb dataset in python. I want to pad the sequences so that each sequence is of same length. I am currently doing it with numpy. What is a good way to do it in tensorflow with tf.pad. I saw the given here but I dont know how to apply it with a 2 d matrix.
Here is my current code
import tensorflow as tf
from keras.datasets import imdb
max_features = 5000
print('Loading data...')
(x_train, y_train), (x_test, y_test) = imdb.load_data(num_words=max_features)
def padSequence(dataset,max_length):
dataset_p = []
for x in dataset:
if(len(x) <=max_length):
dataset_p.append(np.pad(x,pad_width=(0,max_length-len(x)),mode='constant',constant_values=0))
else:
dataset_p.append(x[0:max_length])
return np.array(x_train_p)
max_length = max(len(x) for x in x_train)
x_train_p = padSequence(x_train,max_length)
x_test_p = padSequence(x_test,max_length)
print("input x shape: " ,x_train_p.shape)
Can someone please help ?
I am using tensorflow 1.0
In Response to the comment:
The padding dimensions are given by
# 'paddings' is [[1, 1,], [2, 2]].
I have a 2 d matrix where every row is of different length. I want to be able to pad to to make them of equal length. In my padSequence(dataset,max_length) function, I get the length of every row with len(x) function. Should I just do the same with tf ? Or is there a way to do it like Keras Function
x_train = keras.preprocessing.sequence.pad_sequences(x_train, maxlen=maxlen)
If you want to use tf.pad, according to me you have to iterate for each row.
Code will be something like this:
max_length = 250
number_of_samples = 5
padded_data = np.ndarray(shape=[number_of_samples, max_length],dtype=np.int32)
sess = tf.InteractiveSession()
for i in range(number_of_samples):
reviewToBePadded = dataSet[i] #dataSet numpy array
paddings = [[0,0], [0, maxLength-len(reviewToBePadded)]]
data_tf = tf.convert_to_tensor(reviewToBePadded,tf.int32)
data_tf = tf.reshape(data_tf,[1,len(reviewToBePadded)])
data_tf = tf.pad(data_tf, paddings, 'CONSTANT')
padded_data[i] = data_tf.eval()
print(padded_data)
sess.close()
New to Python, possibly not the best code. But I just want to explain the concept.