When training a model using a custom dataset in Pytorch 1.4 the following error is thrown after a seemingly random amount of epochs.
RuntimeError: Couldn't open shared file mapping: <torch_15324_2327643205>, error code: <1455>
The dataset is wrapped in a torch.utils.data.DataLoader and uses 4 workers, equal to the amount of physical cores.
class TSNDataSet(data.Dataset):
def __init__(self, pickle_file_paths, transforms):
self.pickle_file_paths = pickle_file_paths # list with file paths to pickle files
self.dataset_size = len(pickle_file_paths)
def __getitem__(self, index):
with open(self.pickle_file_paths[index], 'rb') as f:
mffs = pickle.load(f)
return mffs, index
def __len__(self):
return self.dataset_size
It would be helpful to know what the error means and what the possible solutions are.
Related
If we use a combination of the Dataset and Dataloader classes (as shown below), I have to explicitly load the data onto the GPU using .to() or .cuda(). Is there a way to instruct the dataloader to do it automatically/implicitly?
Code to understand/reproduce the scenario:
from torch.utils.data import Dataset, DataLoader
import numpy as np
class DemoData(Dataset):
def __init__(self, limit):
super(DemoData, self).__init__()
self.data = np.arange(limit)
def __len__(self):
return self.data.shape[0]
def __getitem__(self, idx):
return (self.data[idx], self.data[idx]*100)
demo = DemoData(100)
loader = DataLoader(demo, batch_size=50, shuffle=True)
for i, (i1, i2) in enumerate(loader):
print('Batch Index: {}'.format(i))
print('Shape of data item 1: {}; shape of data item 2: {}'.format(i1.shape, i2.shape))
# i1, i2 = i1.to('cuda:0'), i2.to('cuda:0')
print('Device of data item 1: {}; device of data item 2: {}\n'.format(i1.device, i2.device))
Which will output the following; note - without explicit device transfer instruction, the data is loaded onto CPU:
Batch Index: 0
Shape of data item 1: torch.Size([50]); shape of data item 2: torch.Size([50])
Device of data item 1: cpu; device of data item 2: cpu
Batch Index: 1
Shape of data item 1: torch.Size([50]); shape of data item 2: torch.Size([50])
Device of data item 1: cpu; device of data item 2: cpu
A possible solution is at this PyTorch GitHub repo. Issue(still open at the time this question was posted), but, I am unable to make it to work when the dataloader has to return multiple data-items!
You can modify the collate_fn to handle several items at once:
from torch.utils.data.dataloader import default_collate
device = torch.device('cuda:0') # or whatever device/cpu you like
# the new collate function is quite generic
loader = DataLoader(demo, batch_size=50, shuffle=True,
collate_fn=lambda x: tuple(x_.to(device) for x_ in default_collate(x)))
Note that if you want to have multiple workers for the dataloader, you'll need to add
torch.multiprocessing.set_start_method('spawn')
after your if __name__ == '__main__' (see this issue).
Having said that, it seems like using pin_memory=True in your DataLoader would be much more efficient. Have you tried this option?
See memory pinning for more information.
Update (Feb 8th, 2021)
This post made me look at my "data-to-model" time spent during training.
I compared three alternatives:
DataLoader works on CPU and only after the batch is retrieved data is moved to GPU.
Same as (1) but with pin_memory=True in DataLoader.
The proposed method of using collate_fn to move data to GPU.
From my limited experimentation it seems like the second option performs best (but not by a big margin).
The third option required fussing about the start_method of the data loader processes, and it seems to incur an overhead at the beginning of each epoch.
I have a need to use a BatchSampler within a pytorch DataLoader instead of calling __getitem__ of the dataset multiple times (remote dataset, each query is pricy). I cannot understand how to use the batchsampler with any given dataset.
e.g
class MyDataset(Dataset):
def __init__(self, remote_ddf, ):
self.ddf = remote_ddf
def __len__(self):
return len(self.ddf)
def __getitem__(self, idx):
return self.ddf[idx] --------> This is as expensive as a batch call
def get_batch(self, batch_idx):
return self.ddf[batch_idx]
my_loader = DataLoader(MyDataset(remote_ddf),
batch_sampler=BatchSampler(Sampler(), batch_size=3))
The thing I do not understand, neither found any example online or in torch docs, is how do I use my get_batch function instead of the __getitem__ function.
Edit:
Following the answer of Szymon Maszke, this is what I tried and yet, \_\_get_item__ gets one index each call, instead of a list of size batch_size
class Dataset(Dataset):
def __init__(self):
...
def __len__(self):
...
def __getitem__(self, batch_idx): ------> here I get only one index
return self.wiki_df.loc[batch_idx]
loader = DataLoader(
dataset=dataset,
batch_sampler=BatchSampler(
SequentialSampler(dataset), batch_size=self.hparams.batch_size, drop_last=False),
num_workers=self.hparams.num_data_workers,
)
You can't use get_batch instead of __getitem__ and I don't see a point to do it like that.
torch.utils.data.BatchSampler takes indices from your Sampler() instance (in this case 3 of them) and returns it as list so those can be used in your MyDataset __getitem__ method (check source code, most of samplers and data-related utilities are easy to follow in case you need it).
I assume your self.ddf supports list slicing (e.g. self.ddf[[25, 44, 115]] returns values correctly and uses only one expensive call). In this case simply switch get_batch into __getitem__ and you are good to go.
class MyDataset(Dataset):
def __init__(self, remote_ddf, ):
self.ddf = remote_ddf
def __len__(self):
return len(self.ddf)
def __getitem__(self, batch_idx):
return self.ddf[batch_idx] -> batch_idx is a list
EDIT: You have to specify batch_sampler as sampler, otherwise the batch will be divided into single indices. This should be fine:
loader = DataLoader(
dataset=dataset,
# This line below!
sampler=BatchSampler(
SequentialSampler(dataset), batch_size=self.hparams.batch_size, drop_last=False
),
num_workers=self.hparams.num_data_workers,
)
How to use tfrecord with pytorch?
I have downloaded "Youtube8M" datasets with video-level features, but it is stored in tfrecord.
I tried to read some sample from these file to convert it to numpy and then load in pytorch. But it failed.
reader = YT8MAggregatedFeatureReader()
files = tf.gfile.Glob("/Data/youtube8m/train*.tfrecord")
filename_queue = tf.train.string_input_producer(
files, num_epochs=5, shuffle=True)
training_data = [
reader.prepare_reader(filename_queue) for _ in range(1)
]
unused_video_id, model_input_raw, labels_batch, num_frames = tf.train.shuffle_batch_join(
training_data,
batch_size=1024,
capacity=1024 * 5,
min_after_dequeue=1024,
allow_smaller_final_batch=True ,
enqueue_many=True)
with tf.Session() as sess:
label_numpy = labels_batch.eval()
print(type(label_numpy))
But this step have no result, just stuck for a long while without any response.
One work around is to use tensorflow 1.1* eager mode or tensorflow 2+ to loop through the dataset(so you can use var len feature, use buckets window), then just
torch.as_tensor(val.numpy()).to(device) to use in torch.
You can use the DALI library to load the tfrecords directly in a PyTorch code.
You can find out, how to do it in their documentation.
Maybe this can help you: TFRecord reader for PyTorch
I cooked up this:
class LiTS(torch.utils.data.Dataset):
def __init__(self, filenames):
self.filenames = filenames
def __len__(self):
return len(self.filenames)
def __getitem__(self, idx):
volume, segmentation = None, None
if idx >= len(self):
raise IndexError()
ds = tf.data.TFRecordDataset(filenames[idx:idx+1])
for x, y in ds.map(read_tfrecord):
volume = torch.from_numpy(x.numpy())
segmentation = torch.from_numpy(y.numpy())
return volume, segmentation
I'm trying to load multi-modal data (e.g. text and image) in pytorch for image classification. I do not know how to load them simultaneously, like the following code.
def __init__(self, img_path, txt_path, transform=None, loader=default_loader):
def __len__(self):
return len(self.img_name)
def __getitem__(self, item):
Can anyone help me?
In __getitem__, you can use a dictionary or a tuple to represent one sample of your data. Later during training when you create a dataloader using the dataset, pytorch will automatically create batches of dictonary or tuples.
If you want to create samples in a much more different way, check out collate_fn in pytorch.
The method getitem(self, item) would help you do this.
For example:
def __getitem__(self, item): # item can be thought as an index
text = textList[item] # textList would be a list containing the text you want to input into the model for element 'item'
img = imgList[item] # imgList would be a list containing the images you want to input into the model for element 'item'
input = [text, img]
y = labels[item] # labels would be a list containing the label for the input of the text and img. This is your target.
return input, y
I am using keras with a tensorflow-gpu back end on a Ubuntu 17.04 VM.
I have created a custom generator to read inputs and classes from pickle files, but it seems to get the following error:
terminate called after throwing an instance of 'std::ba
d_alloc'
what(): std::bad_alloc
the code for loading data can be seen here:
def data_gen(self, pklPaths, batch_size=16):
while True:
data = []
labels = []
for i, pklPath in enumerate(pklPaths):
# print(pklPath)
image = pickle.load(open(pklPath, 'rb'))
for i in range(batch_size):
# Set a label
data.append(image[0][0])
labels.append(image[1][1])
yield np.array(data), np.array(labels)
then in the train section i'm using a fit generator:
vm_model.fit_generator(vm.data_gen(pkl_train), validation_data=vm.data_gen(pkl_validate), epochs=15, verbose=2,
steps_per_epoch=(5000/16), validation_steps=(1000/16), callbacks=[tb])
the generator should have better memory management than loading everything, however it doesn't seem to be the case! any ideas?
ok, so i found the issue so I'm answering my own question.
Basically, previous one had one unnecessary loop and also kept increasing the size of data and labels essentially loading the entire data in memory:
def data_gen(self, pklPaths, batch_size=16):
while True:
data = []
labels = []
for i, pklPath in enumerate(pklPaths):
# load pickle
image = pickle.load(open(pklPath, 'rb'))
# append
data.append(image[0][0])
labels.append(image[1])
# if batch is complete yield data and labels and reset
if i % batch_size == 0 and i != 0:
yield np.array(data), np.array(labels)
data.clear()
labels.clear()