parallelising tf.data.Dataset.from_generator with TF2.1 - python-3.x

They are already 2 posts about this topics, but they have not been updated for the recent TF2.1 release...
In brief, I've got a lot of tif images to read and parse with a specific pipeline.
import tensorflow as tf
import numpy as np
files = # a list of str
labels = # a list of int
n_unique_label = len(np.unique(labels))
gen = functools.partial(generator, file_list=files, label_list=labels, param1=x1, param2=x2)
dataset = tf.data.Dataset.from_generator(gen, output_types=(tf.float32, tf.int32))
dataset = dataset.map(lambda b, c: (b, tf.one_hot(c, depth=n_unique_label)))
This processing works well. Nevertheless, I need to parallelize the file parsing part, trying the following solution:
files = # a list of str
files = tensorflow.data.Dataset.from_tensor_slices(files)
def wrapper(file_path):
parser = partial(tif_parser, param1=x1, param2=x2)
return tf.py_function(parser, inp=[file_path], Tout=[tf.float32])
dataset = files.map(wrapper, num_parallel_calls=2)
The difference is that I parse one file at a time here with the parser function. However, but it does not work:
File "loader.py", line 643, in tif_parser
image = numpy.array(Image.open(file_path)).astype(float)
File "python3.7/site-packages/PIL/Image.py", line 2815, in open
fp = io.BytesIO(fp.read())
AttributeError: 'tensorflow.python.framework.ops.EagerTensor' object has no attribute 'read'
[[{{node EagerPyFunc}}]] [Op:IteratorGetNextSync]
As far as I understand, the tif_parser function does not receive a string but an (unevaluated) tensor. At now, this function is fairly simple:
def tif_parser(file_path, param1=1, param2=2):
image = numpy.array(Image.open(file_path)).astype(float)
image /= 255.0
return image

Here is how I have proceeded
dataset = tf.data.Dataset.from_tensor_slices((files, labels))
def wrapper(file_path, label):
import functools
parser = functools.partial(tif_parser, param1=x1, param2=x2)
return tf.data.Dataset.from_generator(parser, (tf.float32, tf.int32), args=(file_path, label))
dataset = dataset.interleave(wrapper, cycle_length=tf.data.experimental.AUTOTUNE)
# The labels are converted to 1-hot vectors, could be integrated in tif_parser
dataset = dataset.map(lambda i, l: (i, tf.one_hot(l, depth=unique_label_count)))
dataset = dataset.shuffle(buffer_size=file_count, reshuffle_each_iteration=True)
dataset = dataset.batch(batch_size=batch_size, drop_remainder=False)
dataset = dataset.prefetch(tf.data.experimental.AUTOTUNE)
Concretely, I generate a data set every time the parser is called. The parser is run cycle_length time at each call, meaning that cycle_length images are read at once. This is suited to my specific case, because I cannot load all the images in memory. I am unsure whether the prefetch is used correctly or not here.

Related

Huggingface Transformers NER - Offset Mapping Causing ValueError in NumPy boolean array indexing assignment

I was trying out the NER tutorial Token Classification with W-NUT Emerging Entities (https://huggingface.co/transformers/custom_datasets.html#tok-ner) in google colab using the Annotated Corpus for Named Entity Recognition data on Kaggle (https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus?select=ner_dataset.csv).
I will outline my process in detail to facilitate an understanding of what I was doing and to let the community help me figure out the source of the indexing assignment error.
To load the data from google drive where I have saved it, I used the following code
# import pandas library
import pandas as pd
# columns to select
cols_to_select = ["Sentence #", "Word", "Tag"]
# google drive data path
data_path = '/content/drive/MyDrive/Colab Notebooks/ner/ner_dataset.csv'
# load the data from google colab
dataset = pd.read_csv(data_path, encoding="latin-1")[cols_to_select].fillna(method = 'ffill')
I run the following code to parse the sentences and tags
class SentenceGetter(object):
def __init__(self, data):
self.n_sent = 1
self.data = data
self.empty = False
agg_func = lambda s: [(w, t) for w, t in zip(s["Word"].values.tolist(),
s["Tag"].values.tolist())]
self.grouped = self.data.groupby("Sentence #").apply(agg_func)
self.sentences = [s for s in self.grouped]
def retrieve(self):
try:
s = self.grouped["Sentence: {}".format(self.n_sent)]
self.n_sent += 1
return s
except:
return None
# get full data
getter = SentenceGetter(dataset)
# get sentences
sentences = [[s[0] for s in sent] for sent in getter.sentences]
# get tags/labels
tags = [[s[1] for s in sent] for sent in getter.sentences]
# take a look at the data
print(sentences[0][0:5], tags[0][0:5], sep='\n')
I then split the data into train, val, and test sets
# import the sklearn module
from sklearn.model_selection import train_test_split
# split data in to temp and test sets
temp_texts, test_texts, temp_tags, test_tags = train_test_split(sentences,
tags,
test_size=0.20,
random_state=15)
# split data into train and validation sets
train_texts, val_texts, train_tags, val_tags = train_test_split(temp_texts,
temp_tags,
test_size=0.20,
random_state=15)
After splitting the data, I created encodings for tags and the tokens
unique_tags=dataset.Tag.unique()
# create tags to id
tag2id = {tag: id for id, tag in enumerate(unique_tags)}
# create id to tags
id2tag = {id: tag for tag, id in tag2id.items()}
I then installed the transformer library in colab
# install the transformers library
! pip install transformers
Next I imported the small bert model
# import the transformers module
from transformers import BertTokenizerFast
# import the small bert model
model_name = "google/bert_uncased_L-4_H-512_A-8"
tokenizer = BertTokenizerFast.from_pretrained(model_name)
I then created the encodings for the tokens
# create train set encodings
train_encodings = tokenizer(train_texts,
is_split_into_words=True,
return_offsets_mapping=True,
padding=True,
max_length=128,
truncation=True)
# create validation set encodings
val_encodings = tokenizer(val_texts,
is_split_into_words=True,
return_offsets_mapping=True,
padding=True,
max_length=128,
truncation=True)
# create test set encodings
test_encodings = tokenizer(test_texts,
is_split_into_words=True,
return_offsets_mapping=True,
padding=True,
max_length=128,
truncation=True)
In the tutorial, it uses offset-mapping to handle the problem that arise with word-piece tokenization, specifically, the mismatch between tokens and labels. It is when running the offset-mapping code in the tutorial that I get the error. Below is the offset mapping function used in the tutorial:
# the offset function
import numpy as np
def encode_tags(tags, encodings):
labels = [[tag2id[tag] for tag in doc] for doc in tags]
encoded_labels = []
for doc_labels, doc_offset in zip(labels, encodings.offset_mapping):
# create an empty array of -100
doc_enc_labels = np.ones(len(doc_offset),dtype=int) * -100
arr_offset = np.array(doc_offset)
# set labels whose first offset position is 0 and the second is not 0
doc_enc_labels[(arr_offset[:,0] == 0) & (arr_offset[:,1] != 0)] = doc_labels
encoded_labels.append(doc_enc_labels.tolist())
return encoded_labels
# return the encoded labels
train_labels = encode_tags(train_tags, train_encodings)
val_labels = encode_tags(val_tags, val_encodings)
test_labels = encode_tags(test_tags, test_encodings)
After running the above code, it gives me the following error, and I can't figure out where the source of the error lies. Any help and pointers would be appreciated.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-19-afdff0186eb3> in <module>()
17
18 # return the encoded labels
---> 19 train_labels = encode_tags(train_tags, train_encodings)
20 val_labels = encode_tags(val_tags, val_encodings)
21 test_labels = encode_tags(test_tags, test_encodings)
<ipython-input-19-afdff0186eb3> in encode_tags(tags, encodings)
11
12 # set labels whose first offset position is 0 and the second is not 0
---> 13 doc_enc_labels[(arr_offset[:,0] == 0) & (arr_offset[:,1] != 0)] = doc_labels
14 encoded_labels.append(doc_enc_labels.tolist())
15
ValueError: NumPy boolean array indexing assignment cannot assign 38 input values to the 37 output values where the mask is true

Torchtext 0.7 shows Field is being deprecated. What is the alternative?

Looks like the previous paradigm of declaring Fields, Examples and using BucketIterator is deprecated and will move to legacy in 0.8. However, I don't seem to be able to find an example of the new paradigm for custom datasets (as in, not the ones included in torch.datasets) that doesn't use Field. Can anyone point me at an up-to-date example?
Reference for deprecation:
https://github.com/pytorch/text/releases
It took me a little while to find the solution myself. The new paradigm is like so for prebuilt datasets:
from torchtext.experimental.datasets import AG_NEWS
train, test = AG_NEWS(ngrams=3)
or like so for custom built datasets:
from torch.utils.data import DataLoader
def collate_fn(batch):
texts, labels = [], []
for label, txt in batch:
texts.append(txt)
labels.append(label)
return texts, labels
dataloader = DataLoader(train, batch_size=8, collate_fn=collate_fn)
for idx, (texts, labels) in enumerate(dataloader):
print(idx, texts, labels)
I've copied the examples from the Source
Browsing through torchtext's GitHub repo I stumbled over the README in the legacy directory, which is not documented in the official docs. The README links a GitHub issue that explains the rationale behind the change as well as a migration guide.
If you just want to keep your existing code running with torchtext 0.9.0, where the deprecated classes have been moved to the legacy module, you have to adjust your imports:
# from torchtext.data import Field, TabularDataset
from torchtext.legacy.data import Field, TabularDataset
Alternatively, you can import the whole torchtext.legacy module as torchtext as suggested by the README:
import torchtext.legacy as torchtext
There is a post regarding this. Instead of the deprecated Field and BucketIterator classes, it uses the TextClassificationDataset along with the collator and other preprocessing. It reads a txt file and builds a dataset, followed by a model. Inside the post, there is a link to a complete working notebook. The post is at: https://mmg10.github.io/pytorch/2021/02/16/text_torch.html. But you need the 'dev' (or nightly build) of PyTorch for it to work.
From the link above:
After tokenization and building vocabulary, you can build the dataset as follows
def data_to_dataset(data, tokenizer, vocab):
data = [(text, label) for (text, label) in data]
text_transform = sequential_transforms(tokenizer.tokenize,
vocab_func(vocab),
totensor(dtype=torch.long)
)
label_transform = sequential_transforms(lambda x: 1 if x =='1' else (0 if x =='0' else x),
totensor(dtype=torch.long)
)
transforms = (text_transform, label_transform)
dataset = TextClassificationDataset(data, vocab, transforms)
return dataset
The collator is as follows:
def __init__(self, pad_idx):
self.pad_idx = pad_idx
def collate(self, batch):
text, labels = zip(*batch)
labels = torch.LongTensor(labels)
text = nn.utils.rnn.pad_sequence(text, padding_value=self.pad_idx, batch_first=True)
return text, labels
Then, you can build the dataloader with the typical torch.utils.data.DataLoader using the collate_fn argument.
Well it seems like pipeline could be like that:
import torchtext as TT
import torch
from collections import Counter
from torchtext.vocab import Vocab
# read the data
with open('text_data.txt','r') as f:
data = f.readlines()
with open('labels.txt', 'r') as f:
labels = f.readlines()
tokenizer = TT.data.utils.get_tokenizer('spacy', 'en') # can remove 'spacy' and use a simple built-in tokenizer
train_iter = zip(labels, data)
counter = Counter()
for (label, line) in train_iter:
counter.update(tokenizer(line))
vocab = TT.vocab.Vocab(counter, min_freq=1)
text_pipeline = lambda x: [vocab[token] for token in tokenizer(x)]
# this is data-specific - adapt for your data
label_pipeline = lambda x: 1 if x == 'positive\n' else 0
class TextData(torch.utils.data.Dataset):
'''
very basic dataset for processing text data
'''
def __init__(self, labels, text):
super(TextData, self).__init__()
self.labels = labels
self.text = text
def __getitem__(self, index):
return self.labels[index], self.text[index]
def __len__(self):
return len(self.labels)
def tokenize_batch(batch, max_len=200):
'''
tokenizer to use in DataLoader
takes a text batch of text dataset and produces a tensor batch, converting text and labels though tokenizer, labeler
tokenizer is a global function text_pipeline
labeler is a global function label_pipeline
max_len is a fixed len size, if text is less than max_len it is padded with ones (pad number)
if text is larger that max_len it is truncated but from the end of the string
'''
labels_list, text_list = [], []
for _label, _text in batch:
labels_list.append(label_pipeline(_label))
text_holder = torch.ones(max_len, dtype=torch.int32) # fixed size tensor of max_len
processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int32)
pos = min(200, len(processed_text))
text_holder[-pos:] = processed_text[-pos:]
text_list.append(text_holder.unsqueeze(dim=0))
return torch.FloatTensor(labels_list), torch.cat(text_list, dim=0)
train_dataset = TextData(labels, data)
train_loader = DataLoader(train_dataset, batch_size=2, shuffle=False, collate_fn=tokenize_batch)
lbl, txt = iter(train_loader).next()

How to save list of extracted features in hdf5 file python

I am extracting some features from an audio file and save them in a list and then saving a list in hdf5 file but it cause an error. Previously I am directly saving features in a hdf5 file but it just overwrite all the values and save only the last one.
ampList = []
mffcslist = []
centroidlist = []
i = 0
ampList.append(Xdb) # saving extracted feature in a list
mffcslist.append(mfccs)
centroidlist.append(spectral_centroids)
with h5py.File('C:/Users/Aweem Ashar/Desktop/feature.h5', 'a') as f:
f.close()
for i in range(len(audio_path)):
#print(ampList[i])
f.create_dataset("amplitude", data=ampList[i])
f.create_dataset("MffC", data=mffcslist[i])
f.create_dataset("spectral", data=centroidlist[i])
# plt.show() # To view Wave graph
I didn't look at your code that closely when I wrote my comment. I just realized you are loading your list data one element at a time. There are much better/faster ways to do it with Numpy arrays. I don't know what kind of data you're working with, so created a very simple example with a few floats in ampList. I use np.asarray() to convert the list to a Numpy array and load into the dataset in 1 shot. Much easier and compact. This method (with np.asarray()) will work for any list with elements of a common type (all floats or all ints).
My simple example:
import h5py
import numpy as np
ampList = [ 20., 11., 33., 40., 100. ]
with h5py.File('SO_58092765.h5','w') as h5f:
h5f.create_dataset("amplitude", data=np.asarray(ampList) )
A Better Approach:
The example above addresses your basic question (how to copy the list data into a HDF dataset). However, I think there is a better approach for your scenario. I assume you have amplitude, MffC, and spectral data for each and every audio file, AND it would be convenient to have that data associated with the audio file name. If so, that's where HDF5 and mixed format datatypes are so powerful.
I created a second example (below) to show how you can save mixed data in a single dataset. I assumed the following datatypes (to make the example interesting):
Audio file name: String
amplitude: Float
MffC: Integer
Spectral (centroid): Float array of shape (3,)
This example creates 2 HDF5 files:
SO_58092765_3ds.h5: saves each List as a separate dataset.
SO_58092765_1ds.h5: saves all List data in a single dataset, with each List written to a separate Field/Column.
The second method uses a Numpy datatype (dtype) to define the name and datatype of each column of data in the HDF5 dataset. The dtype is then used to create an empty dataset. Each List is written to the dataset by referencing the field name.
Second example:
import h5py
import numpy as np
fileList = [ 'audio1.mp3', 'audio2.mp3', 'audio11.mp3', 'audio21.mp3','audio22.mp3' ]
ampList = [ 20., 11., 33., 40., 100. ]
mffcslist = [ 12, 8, 9, 14, 33 ]
centroidlist = [ (0.,0.,0.), (1.,0.,0.),
(0.,1.,0.), (0.,1.,0.),
(1.,1.,1.),]
# create SO_58092765_3ds.h5:
with h5py.File('SO_58092765_3ds.h5','w') as h5f:
h5f.create_dataset("amplitude", data=np.asarray(ampList) )
h5f.create_dataset("MffC", data=np.asarray(mffcslist) )
h5f.create_dataset("spectral", data=np.asarray(centroidlist) )
# create SO_58092765_1ds.h5 with ds_dtype:
ds_dtype = np.dtype( [("audiofile",'S20'), ("amplitude",float),
("MffC",int), ("spectral",float, (3,)) ] )
with h5py.File('SO_58092765_1ds.h5','w') as h5f:
ds = h5f.create_dataset("test_data", shape=(len(ampList),), dtype=ds_dtype )
ds['audiofile'] = np.asarray(fileList)
ds['amplitude'] = np.asarray(ampList)
ds['MffC'] = np.asarray(mffcslist)
ds['spectral'] = np.asarray(centroidlist)
import matplotlib.pyplot as plt
from glob import glob
import librosa as lb
import sklearn
import librosa.display
import librosa
import h5py
import numpy as np
dir = 'C:\\Users\\Aweem Ashar\\Desktop\\recordingd'
audio_path = glob(dir + '/*.wav')
ampList = []
mffcslist = []
centroidlist = []
for file in range(0, len(audio_path)):
x, sr = lb.load(audio_path[file])
print(type(x), type(sr))
librosa.display.waveplot(x, sr=sr)
X = librosa.stft(x)
Xdb = librosa.amplitude_to_db(abs(X))
librosa.display.specshow(Xdb, sr=sr, x_axis='time', y_axis='hz')
spectral_centroids = librosa.feature.spectral_centroid(x, sr=sr)[0]
spectral_centroids.shape
frames = range(len(spectral_centroids))
t = librosa.frames_to_time(frames)
def normalize(x, axis=0):
return sklearn.preprocessing.minmax_scale(x, axis=axis)
librosa.display.waveplot(x, sr=sr, alpha=0.4)
plt.plot(t, normalize(spectral_centroids), color='r')
mfccs = librosa.feature.mfcc(x, sr=sr)
print(mfccs.shape)
librosa.display.specshow(mfccs, sr=sr, x_axis='time')
ampList.append(Xdb)
mffcslist.append(mfccs)
centroidlist.append(spectral_centroids)
with h5py.File('C:/Users/Aweem Ashar/Desktop/feature.h5', 'w') as f:
f.create_dataset("amplitude", data=np.asarray(ampList))
f.create_dataset("MFCC", data=np.asarray(mffcslist))
f.create_dataset("SpectralCentroid", data=np.asarray(centroidlist))
Aweem, I'm not familiar with librosa or sklearn so can't debug all your code. When working with something new, use a minimally complete verifiable example (MCVE) to confirm behavior with simple data sets. They are much easier to diagnose.
To do that, I simplified the for loop in your second post; reorganizing and removing what I thought was unnecessary. Also, you don't need to loop over all the images. Change the glob() call to get 1 (or a few) images. The shape of Xdb (saved to ampList) should show why Numpy asarray() tries to broadcast the shape (and why you get an error). If not, post the output for review.
Finally, you should add a comment to create_dataset("amplitude") to verify that the other 2 create_dataset() calls work. Good luck.
dir = 'C:\\Users\\Aweem Ashar\\Desktop\\recordingd'
# change this to get 1 wav file:
audio_path = glob(dir + '/*.wav')
ampList = []
mffcslist = []
centroidlist = []
for file in range(0, len(audio_path)):
x, sr = lb.load(audio_path[file])
print(x, sr)
X = librosa.stft(x)
Xdb = librosa.amplitude_to_db(abs(X))
print (Xdb.shape)
ampList.append(Xdb)
spectral_centroids = librosa.feature.spectral_centroid(x, sr=sr)[0]
print (spectral_centroids.shape)
centroidlist.append(spectral_centroids)
mfccs = librosa.feature.mfcc(x, sr=sr)
print (mfccs.shape)
mffcslist.append(mfccs)
with h5py.File('C:/Users/Aweem Ashar/Desktop/feature.h5', 'w') as f:
f.create_dataset("amplitude", data=np.asarray(ampList))
f.create_dataset("MFCC", data=np.asarray(mffcslist))
f.create_dataset("SpectralCentroid", data=np.asarray(centroidlist))

How to map audio to target text transcription

I am new to deep learning, I am making a basic end to end voice recognizer using tensorflow API, LSTM model and ctc loss function. I have extracted my audio features to mfccs. i don't really know how to map my audios to transcriptions, i know ctc is use for the purpose, I know how ctc works but don't know the code to implement it.
Here is my code to extract features
import os
import numpy as np
import glob
import scipy.io.wavfile as wav
from python_speech_features import mfcc, logfbank
# Read the input audio file
for f in glob.glob('Downloads/DataVoices/Training/**/*.wav', recursive=True):
(rate,sig) = wav.read(f)
sig = sig.astype(np.float64)
# Take the first 10,000 samples for analysis
#sig = sig[:10000]
mfcc_feat = mfcc(sig,rate,winlen=0.025, winstep=0.01,
numcep=13, nfilt=26, nfft=512, lowfreq=0, highfreq=None,
preemph=0.97, ceplifter=22, appendEnergy=True)
fbank_feat = logfbank(sig, rate)
acoustic_features = np.concatenate((mfcc_feat, fbank_feat), axis=1) # time_stamp x n_features
print(acoustic_features)
I have also made a training list.txt file where i have provided transcriptions with audio path like:
this is example/001/001.wav
this is example/001/001(1).wav
where 001 is folder and 001.wav and 0001(1).wav are two wave files of one utterance.
I am posting this as a contrived example assuming that this would give an idea of how a CSV file and filenames inside the CSV can be read. You could modify this to suit your needs.
Let's say I have this CSV file. The first column is your transcript. The file path is your audio file. In my case it is just a text file with random text.
Script1,D:/PycharmProjects/TensorFlow/script1.txt
Script2,D:/PycharmProjects/TensorFlow/script2.txt
This is the code I use to test it. Please remember this is an example.
import tensorflow as tf
batch_size = 1
record_defaults = [ ['Test'],['D:/PycharmProjects/TensorFlow/script1.txt']]
def readbatch(data_queue) :
reader = tf.TextLineReader()
_, rows = reader.read_up_to(data_queue, batch_size)
transcript,wav_filename = tf.decode_csv(rows, record_defaults,field_delim=",")
audioreader = tf.WholeFileReader()
print(wav_filename)
_, audio = audioreader.read( tf.train.string_input_producer(wav_filename) )
return [audio,transcript]
data_queue = tf.train.string_input_producer(['D:\\PycharmProjects\\TensorFlow\\script.csv'], shuffle=False)
batch_data = readbatch(data_queue)
batch_values = tf.train.batch(batch_data, shapes=[tf.TensorShape(()),tf.TensorShape(batch_size,)], batch_size=batch_size, enqueue_many=False )
init = tf.initialize_all_variables()
with tf.Session() as sess:
sess.run(init)
sess.run(tf.initialize_local_variables())
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(coord=coord)
try:
step = 0
while not coord.should_stop():
step += 1
feat = sess.run([batch_values])
audio = feat[0][0]
print(audio)
script = feat[0][1]
print(script)
except tf.errors.OutOfRangeError:
print(' training for 1 epochs, %d steps', step)
finally:
coord.request_stop()
coord.join(threads)

Data loading with variable batch size?

I am currently working on patch based super-resolution. Most of the papers divide an image into smaller patches and then use the patches as input to the models.I was able to create patches using custom dataloader. The code is given below:
import torch.utils.data as data
from torchvision.transforms import CenterCrop, ToTensor, Compose, ToPILImage, Resize, RandomHorizontalFlip, RandomVerticalFlip
from os import listdir
from os.path import join
from PIL import Image
import random
import os
import numpy as np
import torch
def is_image_file(filename):
return any(filename.endswith(extension) for extension in [".png", ".jpg", ".jpeg", ".bmp"])
class TrainDatasetFromFolder(data.Dataset):
def __init__(self, dataset_dir, patch_size, is_gray, stride):
super(TrainDatasetFromFolder, self).__init__()
self.imageHrfilenames = []
self.imageHrfilenames.extend(join(dataset_dir, x)
for x in sorted(listdir(dataset_dir)) if is_image_file(x))
self.is_gray = is_gray
self.patchSize = patch_size
self.stride = stride
def _load_file(self, index):
filename = self.imageHrfilenames[index]
hr = Image.open(self.imageHrfilenames[index])
downsizes = (1, 0.7, 0.45)
downsize = 2
w_ = int(hr.width * downsizes[downsize])
h_ = int(hr.height * downsizes[downsize])
aug = Compose([Resize([h_, w_], interpolation=Image.BICUBIC),
RandomHorizontalFlip(),
RandomVerticalFlip()])
hr = aug(hr)
rv = random.randint(0, 4)
hr = hr.rotate(90*rv, expand=1)
filename = os.path.splitext(os.path.split(filename)[-1])[0]
return hr, filename
def _patching(self, img):
img = ToTensor()(img)
LR_ = Compose([ToPILImage(), Resize(self.patchSize//2, interpolation=Image.BICUBIC), ToTensor()])
HR_p, LR_p = [], []
for i in range(0, img.shape[1] - self.patchSize, self.stride):
for j in range(0, img.shape[2] - self.patchSize, self.stride):
temp = img[:, i:i + self.patchSize, j:j + self.patchSize]
HR_p += [temp]
LR_p += [LR_(temp)]
return torch.stack(LR_p),torch.stack(HR_p)
def __getitem__(self, index):
HR_, filename = self._load_file(index)
LR_p, HR_p = self._patching(HR_)
return LR_p, HR_p
def __len__(self):
return len(self.imageHrfilenames)
Suppose the batch size is 1, it takes an image and gives an output of size [x,3,patchsize,patchsize]. When batch size is 2, I will have two different outputs of size [x,3,patchsize,patchsize] (for example image 1 may give[50,3,patchsize,patchsize], image 2 may give[75,3,patchsize,patchsize] ). To handle this a custom collate function was required that stacks these two outputs along dimension 0. The collate function is given below:
def my_collate(batch):
data = torch.cat([item[0] for item in batch],dim = 0)
target = torch.cat([item[1] for item in batch],dim = 0)
return [data, target]
This collate function concatenates along x (From the above example, I finally get [125,3,patchsize,pathsize]. For training purposes, I need to train the model using a minibatch size of say 25. Is there any method or any functions which I can use to directly get an output of size [25 , 3, patchsize, pathsize] directly from the dataloader using the necessary number of images as input to the Dataloader?
The following code snippet works for your purpose.
First, we define a ToyDataset which takes in a list of tensors (tensors) of variable length in dimension 0. This is similar to the samples returned by your dataset.
import torch
from torch.utils.data import Dataset
from torch.utils.data.sampler import RandomSampler
class ToyDataset(Dataset):
def __init__(self, tensors):
self.tensors = tensors
def __getitem__(self, index):
return self.tensors[index]
def __len__(self):
return len(tensors)
Secondly, we define a custom data loader. The usual Pytorch dichotomy to create datasets and data loaders is roughly the following: There is an indexed dataset, to which you can pass an index and it returns the associated sample from the dataset. There is a sampler which yields an index, there are different strategies to draw indices which give rise to different samplers. The sampler is used by a batch_sampler to draw multiple indices at once (as many as specified by batch_size). There is a dataloader which combines sampler and dataset to let you iterate over a dataset, importantly the data loader also owns a function (collate_fn) which specifies how the multiple samples retrieved from the dataset using the indices from the batch_sampler should be combined. For your use case, the usual PyTorch dichotomy does not work well, because instead of drawing a fixed number of indices, we need to draw indices until the objects associated with the indices exceed the cumulative size we desire. This means we need immediate inspection of the objects and use this knowledge to decide whether to return a batch or keep drawing indices. This is what the custom data loader below does:
class CustomLoader(object):
def __init__(self, dataset, my_bsz, drop_last=True):
self.ds = dataset
self.my_bsz = my_bsz
self.drop_last = drop_last
self.sampler = RandomSampler(dataset)
def __iter__(self):
batch = torch.Tensor()
for idx in self.sampler:
batch = torch.cat([batch, self.ds[idx]])
while batch.size(0) >= self.my_bsz:
if batch.size(0) == self.my_bsz:
yield batch
batch = torch.Tensor()
else:
return_batch, batch = batch.split([self.my_bsz,batch.size(0)-self.my_bsz])
yield return_batch
if batch.size(0) > 0 and not self.drop_last:
yield batch
Here we iterate over the dataset, after drawing an index and loading the associated object, we concatenate it to the tensors we drew before (batch). We keep doing this until we reach the desired size, such that we can cut out and yield a batch. We retain the rows in batch, which we did not yield. Because it may be the case that a single instance exceeds the desired batch_size, we use a while loop.
You could modify this minimal CustomDataloader to add more features in the style of PyTorch's dataloader. There is also no need to use a RandomSampler to draw in indices, others would work equally well. It would also be possible to avoid repeated concats, in case your data is large by using for example a list and keeping track of the cumulative length of its tensors.
Here is an example, that demonstrates it works:
patch_size = 5
channels = 3
dim0sizes = torch.LongTensor(100).random_(1, 100)
data = torch.randn(size=(dim0sizes.sum(), channels, patch_size, patch_size))
tensors = torch.split(data, list(dim0sizes))
ds = ToyDataset(tensors)
dl = CustomLoader(ds, my_bsz=250, drop_last=False)
for i in dl:
print(i.size(0))
(Related, but not exactly in topic)
For batch size adaptation you can use the code as exemplified in this repo. It is implemented for a different purpose (maximize GPU memory usage), but it is not too hard to translate to your problem.
The code does batch adaptation and batch spoofing.
To improve the previous answer, I found a repo that uses DataManger to achieve different patch sizes and batch sizes. It is basically initiating different dataloaders with different settings and a set_epoch function is used to set the appropriate dataloader for a given epoch.

Resources