Related
I kept getting this error no matter which model I used, so I was wondering if anybody can give me pointeres what is happening? And how do I solve this issue?
Input data for this model is: http://vision.stanford.edu/aditya86/ImageNetDogs/
Most likely the issue must be stemmed from this part, but wonder which part of my code can fix this issue?:
(0) INVALID_ARGUMENT: Expected dimension in the range [0, 0), but got 0
[[{{node ArgMax}}]]
[[IteratorGetNext]]
[[IteratorGetNext/_1321]]
(1) INVALID_ARGUMENT: Expected dimension in the range [0, 0), but got 0
[[{{node ArgMax}}]]
[[IteratorGetNext]]
I'm running this code on a HPC system with GPU.
I think my data preprocessing should be fine, considering I've QAd it.
Code snippets below
Here is the error generated by the code.
Traceback (most recent call last):
File "/mnt/lustre/indy2lfs/work/mdisspt/mdisspt/y2136744/modelzoo/fc_dog_model/tf/run.py", line 292, in <module>
main()
File "/mnt/lustre/indy2lfs/work/mdisspt/mdisspt/y2136744/modelzoo/fc_dog_model/tf/run.py", line 281, in main
run(
File "/mnt/lustre/indy2lfs/work/mdisspt/mdisspt/y2136744/modelzoo/fc_dog_model/tf/run.py", line 226, in run
est.train(
File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 360, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1186, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1217, in _train_model_default
return self._train_with_estimator_spec(estimator_spec, worker_hooks,
File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/tensorflow_estimator/python/estimator/estimator.py", line 1533, in _train_with_estimator_spec
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/tensorflow/python/training/monitored_session.py", line 782, in run
return self._sess.run(
File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/tensorflow/python/training/monitored_session.py", line 1311, in run
return self._sess.run(
File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/tensorflow/python/training/monitored_session.py", line 1416, in run
raise six.reraise(*original_exc_info)
File "/mnt/lustre/indy2lfs/sw/miniconda3/4.12.0-py39-gpu/lib/python3.9/site-packages/six.py", line 719, in reraise
raise value
File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/tensorflow/python/training/monitored_session.py", line 1401, in run
return self._sess.run(*args, **kwargs)
File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/tensorflow/python/training/monitored_session.py", line 1469, in run
outputs = _WrappedSession.run(
File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/tensorflow/python/training/monitored_session.py", line 1232, in run
return self._sess.run(*args, **kwargs)
File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 967, in run
result = self._run(None, fetches, feed_dict, options_ptr,
File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1190, in _run
results = self._do_run(handle, final_targets, final_fetches,
File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1370, in _do_run
return self._do_call(_run_fn, feeds, fetches, targets, options,
File "/mnt/lustre/indy2lfs/sw/horovod/0.25.0-gpu/python/3.9.13/lib/python3.9/site-packages/tensorflow/python/client/session.py", line 1396, in _do_call
raise type(e)(node_def, op, message) # pylint: disable=no-value-for-parameter
tensorflow.python.framework.errors_impl.InvalidArgumentError: Graph execution error:
2 root error(s) found.
(0) INVALID_ARGUMENT: Expected dimension in the range [0, 0), but got 0
[[{{node ArgMax}}]]
[[IteratorGetNext]]
[[IteratorGetNext/_1321]]
(1) INVALID_ARGUMENT: Expected dimension in the range [0, 0), but got 0
[[{{node ArgMax}}]]
[[IteratorGetNext]]
0 successful operations.
0 derived errors ignored.
srun: error: r2i4n1: task 0: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=4084847.0
Run.py
def run(
args, params, model_fn, train_input_fn=None, eval_input_fn=None,
):
dtype = tf.keras.mixed_precision.Policy(
'mixed_float16', # Important: This is required.
)
tf.keras.mixed_precision.set_global_policy(dtype)
# update and validate runtime params
runconfig_params = params["runconfig"]
update_params_from_args(args, runconfig_params)
validate_params(params)
# save params for reproducibility
save_params(params, model_dir=runconfig_params["model_dir"])
# get runtime configurations
use_cs = is_cs(runconfig_params)
csrunconfig_dict = get_csrunconfig_dict(runconfig_params)
stack_params = get_custom_stack_params(params)
# prep cs1 run environment, run config and estimator
check_env(runconfig_params)
est_config = CSRunConfig(
cs_ip=runconfig_params["cs_ip"],
stack_params=stack_params,
**csrunconfig_dict,
)
model= model_fn()
est = tf.keras.estimator.model_to_estimator(
keras_model=model,
model_dir=runconfig_params["model_dir"],
# config=est_config,
# params=params,
)
# execute based on mode
elif runconfig_params["mode"] == "train":
# est.compile(input_fn=train_input_fn)
est.train(
input_fn=train_input_fn,
steps=runconfig_params["steps"],
max_steps=runconfig_params["max_steps"],
# use_cs=use_cs,
)
def main():
"""
Main function
"""
dtype = Policy('mixed_float16')
tf.keras.mixed_precision.set_global_policy(dtype)
tf.keras.backend.set_floatx('float16')
default_model_dir = os.path.join(
os.path.dirname(os.path.abspath(__file__)), "model_dir"
)
parser = create_arg_parser(default_model_dir)
args = parser.parse_args(sys.argv[1:])
params = get_params(args.params)
print(params)
summary_context = (
cs_disable_summaries if args.multireplica else cs_enable_summaries
)
with summary_context():
run(
args=args,
params=params,
model_fn=model_fn,
train_input_fn=train_input_fn,
# eval_input_fn=eval_input_fn,
)
if __name__ == "__main__":
tf.compat.v1.logging.set_verbosity(tf.compat.v1.logging.INFO)
main()
Model.py
def model_fn():
dtype = Policy('mixed_float16')
tf.keras.mixed_precision.set_global_policy(dtype)
# tf.keras.backend.set_floatx('float16')
inputs = tf.keras.Input(shape=(331,331,3))
# Entry block
x = layers.Conv2D(128, 3, strides=2, padding="same")(inputs)
x = layers.BatchNormalization()(x)
x = layers.Activation("relu")(x)
previous_block_activation = x # Set aside residual
for size in [256, 512, 728]:
x = layers.Activation("relu")(x)
x = layers.SeparableConv2D(size, 3, padding="same")(x)
x = layers.BatchNormalization()(x)
x = layers.Activation("relu")(x)
x = layers.SeparableConv2D(size, 3, padding="same")(x)
x = layers.BatchNormalization()(x)
x = layers.MaxPooling2D(3, strides=2, padding="same")(x)
# Project residual
residual = layers.Conv2D(size, 1, strides=2, padding="same")(
previous_block_activation
)
x = layers.add([x, residual]) # Add back residual
previous_block_activation = x # Set aside next residual
x = layers.SeparableConv2D(1024, 3, padding="same")(x)
x = layers.BatchNormalization()(x)
x = layers.Activation("relu")(x)
x = layers.GlobalAveragePooling2D()(x)
activation = "softmax"
units = 1
x = layers.Dropout(0.5)(x)
outputs = layers.Dense(1, activation=activation)(x)
estimator_model = tf.keras.Model(inputs, outputs)
estimator_model.compile(
optimizer=tf.keras.optimizers.Adam(),
loss="categorical_crossentropy",
# metrics=['accuracy']
)
estimator_model.summary()
return estimator_model
data.py
def input_fn(params, mode=tf.estimator.ModeKeys.TRAIN):
"""
:param <dict> params: dict containing input parameters for creating dataset.
Expects the following fields:
- "data_dir" (string): path to the data files to use.
- "batch_size" (int): batch size
- "to_float16" (bool): whether to convert to float16 or not
- "drop_last_batch" (bool): whether to drop the last batch or not
"""
params = {
'train_input': {
'shuffle': True,
'data_dir': 'dog_breed_dataset', # Place to store data
'batch_size': 32,
'num_parallel_calls': 0 # 0 means AUTOTUNE
}
}
training = mode == tf.estimator.ModeKeys.TRAIN
evaluating = mode == tf.estimator.ModeKeys.EVAL
ds = None
input_params = params["train_input"]
data_dir = input_params["data_dir"]
# setting num_parallel_calls to 0 implies AUTOTUNE
num_parallel_calls = input_params.get("num_parallel_calls", 0)
batch_size = (
input_params.get("train_batch_size")
if training
else input_params.get("eval_batch_size")
)
if batch_size is None:
batch_size = input_params["batch_size"]
list_ds = tf.data.Dataset.list_files(str(data_dir+'/*/*'), shuffle=False)
class_names = np.array(sorted([item.split('/')[-1] for item in glob.glob(data_dir + '/*')]))
val_size = int(image_count * 0.2)
def get_label(file_path):
# Convert the path to a list of path components
parts = tf.strings.split(file_path, os.path.sep)
one_hot = parts[-2] == class_names
one_hot=tf.cast(one_hot, tf.int32)
return tf.argmax(one_hot)
# return one_hot
def decode_img(img):
# Convert the compressed string to a 3D uint8 tensor
img = tf.io.decode_jpeg(img, channels=3)
img = tf.cast(img, tf.float16)
img = (img/225)
# img = tf.keras.applications.mobilenet.preprocess_input(img)
# Resize the image to the desired size
return tf.image.resize(img, [image_param['img_height'], image_param["img_width"]])
def process_path(file_path):
label = get_label(file_path)
# Load the raw data from the file as a string
img = tf.io.read_file(file_path)
img = decode_img(img)
return img, label
if training and input_params["shuffle"]:
list_ds = list_ds.shuffle(image_count, reshuffle_each_iteration=False)
if training:
ds = list_ds.skip(val_size)
ds = ds.repeat()
else:
ds = list_ds.take(val_size)
ds = ds.map(
process_path,
num_parallel_calls=num_parallel_calls
if num_parallel_calls > 0
else tf.data.experimental.AUTOTUNE,
)
return ds
def train_input_fn(params=None):
return input_fn(params, mode=tf.estimator.ModeKeys.TRAIN)
I have a map-stype dataset, which is used for instance segmentation tasks.
The dataset is very imbalanced, in the sense that some images have only 10 objects while others have up to 1200.
How can I limit the number of objects per batch?
A minimal reproducible example is:
import math
import torch
import random
import numpy as np
import pandas as pd
from torch.utils.data import Dataset
from torch.utils.data.sampler import BatchSampler
np.random.seed(0)
random.seed(0)
torch.manual_seed(0)
W = 700
H = 1000
def collate_fn(batch) -> tuple:
return tuple(zip(*batch))
class SyntheticDataset(Dataset):
def __init__(self, image_ids):
self.image_ids = torch.tensor(image_ids, dtype=torch.int64)
self.num_classes = 9
def __len__(self):
return len(self.image_ids)
def __getitem__(self, idx: int):
"""
returns single sample
"""
# print("idx: ", idx)
# deliberately left dangling
# id = self.image_ids[idx].item()
# image_id = self.image_ids[idx]
image_id = torch.as_tensor(idx)
image = torch.randint(0, 255, (H, W))
num_objects = random.randint(10, 1200)
image = torch.randint(0, 255, (3, H, W))
masks = torch.randint(0, 255, (num_objects, H, W))
target = {}
target["image_id"] = image_id
areas = torch.randint(100, 20000, (1, num_objects), dtype=torch.int64)
boxes = torch.randint(100, H * W, (num_objects, 4), dtype=torch.int64)
labels = torch.randint(1, self.num_classes, (1, num_objects), dtype=torch.int64)
iscrowd = torch.zeros(len(labels), dtype=torch.int64)
target["boxes"] = boxes
target["labels"] = labels
target["area"] = areas
target["iscrowd"] = iscrowd
target["masks"] = masks
return image, target, image_id
class BalancedObjectsSampler(BatchSampler):
"""Samples either batch_size images or batches num_objs_per_batch objects.
Args:
data_source (list): contains tuples of (img_id).
batch_size (int): batch size.
num_objs_per_batch (int): number of objects in a batch.
Return
yields the batch_ids/image_ids/image_indices
"""
def __init__(self, data_source, batch_size, num_objs_per_batch, drop_last=False):
self.data_source = data_source
self.sampler = data_source
self.batch_size = batch_size
self.drop_last = drop_last
self.num_objs_per_batch = num_objs_per_batch
self.batch_count = math.ceil(len(self.data_source) / self.batch_size)
def __iter__(self):
obj_count = 0
batch = []
batches = []
counter = 0
for i, (k, s) in enumerate(self.data_source.iteritems()):
if (
obj_count <= obj_count + s
and len(batch) <= self.batch_size - 1
and obj_count + s <= self.num_objs_per_batch
and i < len(self.data_source) - 1
):
# because of https://pytorch.org/docs/stable/data.html#data-loading-order-and-sampler
batch.append(i)
obj_count += s
else:
batches.append(batch)
yield batch
obj_count = 0
batch = []
counter += 1
obj_sums = {}
batch_size = 10
workers = 4
fake_image_ids = np.random.randint(1600000, 1700000, 100)
# assigning any in-range number objects count to each image
for i, k in enumerate(fake_image_ids):
obj_sums[k] = random.randint(10, 1200)
obj_counts = pd.Series(obj_sums)
train_dataset = SyntheticDataset(image_ids=fake_image_ids)
balanced_sampler = BalancedObjectsSampler(
data_source=obj_counts,
batch_size=batch_size,
num_objs_per_batch=1500,
drop_last=False,
)
data_loader_sampler = torch.utils.data.DataLoader(
train_dataset,
num_workers=workers,
collate_fn=collate_fn,
sampler=balanced_sampler,
)
data_loader_iter = torch.utils.data.DataLoader(
train_dataset,
batch_size=batch_size,
shuffle=False,
num_workers=workers,
collate_fn=collate_fn,
)
Iterating over the balanced_sampler
for i, bal_batch in enumerate(balanced_sampler):
print(f"batch_{i}: ", bal_batch)
yields
batch_0: [0]
batch_1: [2, 3]
batch_2: [5]
batch_3: [7]
batch_4: [9, 10]
batch_5: [12, 13, 14, 15]
batch_6: [17, 18]
batch_7: [20, 21, 22]
batch_8: [24, 25]
batch_9: [27]
batch_10: [29]
batch_11: [31]
batch_12: [33]
batch_13: [35, 36, 37]
batch_14: [39, 40]
batch_15: [42, 43]
batch_16: [45, 46]
batch_17: [48, 49, 50]
batch_18: [52, 53, 54]
batch_19: [56]
batch_20: [58, 59]
batch_21: [61, 62]
batch_22: [64]
batch_23: [66]
batch_24: [68]
batch_25: [70, 71]
batch_26: [73]
batch_27: [75, 76, 77]
batch_28: [79, 80]
batch_29: [82, 83, 84, 85, 86, 87]
batch_30: [89]
batch_31: [91]
batch_32: [93, 94]
batch_33: [96]
batch_34: [98]
The above displayed values are the images' indices, but could also be the batch index or even the images' ids.
By running
for i, batch in enumerate(data_loader_sampler):
print("__sample__: ", i, len(batch[0]))
One sees that the batch contains a single sample instead of the expected amount.
__sample__: 0 1
__sample__: 1 1
__sample__: 2 1
__sample__: 3 1
__sample__: 4 1
__sample__: 5 1
__sample__: 6 1
__sample__: 7 1
__sample__: 8 1
__sample__: 9 1
__sample__: 10 1
__sample__: 11 1
__sample__: 12 1
__sample__: 13 1
__sample__: 14 1
__sample__: 15 1
__sample__: 16 1
__sample__: 17 1
__sample__: 18 1
__sample__: 19 1
__sample__: 20 1
__sample__: 21 1
__sample__: 22 1
__sample__: 23 1
__sample__: 24 1
__sample__: 25 1
__sample__: 26 1
__sample__: 27 1
__sample__: 28 1
__sample__: 29 1
__sample__: 30 1
__sample__: 31 1
__sample__: 32 1
__sample__: 33 1
__sample__: 34 1
What I am really trying to prevent is the following behavior that arises from
for i, batch in enumerate(data_loader_iter):
print("__iter__: ", i, sum([k["masks"].shape[0] for k in batch[1]]))
which is
__iter__: 0 2510
__iter__: 1 2060
__iter__: 2 2203
__iter__: 3 2815
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
Traceback (most recent call last):
File "/usr/lib/python3.8/multiprocessing/queues.py", line 239, in _feed
obj = _ForkingPickler.dumps(obj)
File "/usr/lib/python3.8/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
File "/blip/venv/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 328, in reduce_storage
fd, size = storage._share_fd_()
RuntimeError: falseINTERNAL ASSERT FAILED at "../aten/src/ATen/MapAllocator.cpp":300, please report a bug to PyTorch. unable to write to file </torch_431207_56>
Traceback (most recent call last):
File "/blip/venv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 990, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/usr/lib/python3.8/multiprocessing/queues.py", line 107, in get
if not self._poll(timeout):
File "/usr/lib/python3.8/multiprocessing/connection.py", line 257, in poll
return self._poll(timeout)
File "/usr/lib/python3.8/multiprocessing/connection.py", line 424, in _poll
r = wait([self], timeout)
File "/usr/lib/python3.8/multiprocessing/connection.py", line 931, in wait
ready = selector.select(timeout)
File "/usr/lib/python3.8/selectors.py", line 415, in select
fd_event_list = self._selector.poll(timeout)
File "/blip/venv/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 431257) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "so.py", line 170, in <module>
for i, batch in enumerate(data_loader_iter):
File "/blip/venv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 521, in __next__
data = self._next_data()
File "/blip/venv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1186, in _next_data
idx, data = self._get_data()
File "/blip/venv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1152, in _get_data
success, data = self._try_get_data()
File "/blip/venv/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1003, in _try_get_data
raise RuntimeError('DataLoader worker (pid(s) {}) exited unexpectedly'.format(pids_str)) from e
RuntimeError: DataLoader worker (pid(s) 431257) exited unexpectedly
which invariably happens when the number of objects per batch is greater than ~2500.
An immediate workaround would be to set the batch_size low, I just need a more optimal solution.
If what you are trying to solve really is:
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
You could try resizing the allocated shared memory with
# mount -o remount,size=<whatever_is_enough>G /dev/shm
However, as this is not always possible, one fix to your problem would be
class SyntheticDataset(Dataset):
def __init__(self, image_ids):
self.image_ids = torch.tensor(image_ids, dtype=torch.int64)
self.num_classes = 9
def __len__(self):
return len(self.image_ids)
def __getitem__(self, indices):
worker_info = torch.utils.data.get_worker_info()
batch = []
for i in indices:
sample = self.get_sample(i)
batch.append(sample)
gc.collect()
return batch
def get_sample(self, idx: int):
image_id = torch.as_tensor(idx)
image = torch.randint(0, 255, (H, W))
num_objects = idx
image = torch.randint(0, 255, (3, H, W))
masks = torch.randint(0, 255, (num_objects, H, W))
target = {}
target["image_id"] = image_id
areas = torch.randint(100, 20000, (1, num_objects), dtype=torch.int64)
boxes = torch.randint(100, H * W, (num_objects, 4), dtype=torch.int64)
labels = torch.randint(1, self.num_classes, (1, num_objects), dtype=torch.int64)
iscrowd = torch.zeros(len(labels), dtype=torch.int64)
target["boxes"] = boxes
target["labels"] = labels
target["area"] = areas
target["iscrowd"] = iscrowd
target["masks"] = masks
return image, target, image_id
and
class BalancedObjectsSampler(BatchSampler):
"""Samples either batch_size images or batches num_objs_per_batch objects.
Args:
data_source (list): contains tuples of (img_id).
batch_size (int): batch size.
num_objs_per_batch (int): number of objects in a batch.
Return
yields the batch_ids/image_ids/image_indices
"""
def __init__(self, data_source, batch_size, num_objs_per_batch, drop_last=False):
self.data_source = data_source
self.sampler = data_source
self.batch_size = batch_size
self.drop_last = drop_last
self.num_objs_per_batch = num_objs_per_batch
self.batch_count = math.ceil(len(self.data_source) / self.batch_size)
obj_count = 0
batch = []
batches = []
batches_sums = []
for i, (k, s) in enumerate(self.data_source.iteritems()):
if (
len(batch) < self.batch_size
and obj_count + s < self.num_objs_per_batch
and i < len(self.data_source) - 1
):
batch.append(s)
obj_count += s
else:
batches.append(len(batch))
batches_sums.append(obj_count)
obj_count = 0
batch = []
self.batches = batches
self.batch_count = len(batches)
def __iter__(self):
batch = []
img_counts_id = 0
for idx, (k, s) in enumerate(self.data_source.iteritems()):
if len(batch) < self.batches[img_counts_id] and idx < len(self.data_source):
batch.append(s)
elif len(batch) == self.batches[img_counts_id]:
gc.collect()
yield batch
batch = []
if img_counts_id < self.batch_count - 1:
img_counts_id += 1
else:
break
if len(batch) > 0 and not self.drop_last:
yield batch
def __len__(self) -> int:
if self.drop_last:
return len(self.data_source) // self.batch_size
else:
return (len(self.data_source) + self.batch_size - 1) // self.batch_size
As SyntheticDataset's __getitem__ was receiving a list of indices, the simplest solution would just iterate over the indices and retrieve a list of samples. You may just have to collate the output differently in order to feed it to your model.
For the BalancedObjectsSampler, I calculated the size of each batch within the __init__ and used it in __iter__ to assemble the batches.
NOTE: This will still fail if your num_workers > 0 for you are trying to pack at most 1500 objects into a batch - and usually one worker loads one batch at a time. Hence, you have to re-assess your num_objs_per_batch when considering using multiprocessing.
This code for my custom data loader runs smoothly with batch_size=1, but when I increase batch size I get the following Error:
RuntimeError: Expected object of scalar type Double but got scalar type Long for sequence element 1 in sequence argument at position #1 'tensors'
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
matplotlib.use("TkAgg")
import os, h5py
import PIL
#------------------------------
import torch
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms
#------------------------------
from data_augmentation import *
#------------------------------
dtype = torch.cuda.FloatTensor if torch.cuda.is_available() else torch.FloatTensor
class NiftiDataset(Dataset):
def __init__(self,transformation_params,data_path, mode='train',transforms=None ):
"""
Parameters:
data_path (string): Root directory of the preprocessed dataset.
mode (string, optional): Select the image_set to use, ``train``, ``valid``
transforms (callable, optional): Optional transform to be applied
on a sample.
"""
self.data_path = data_path
self.mode = mode
self.images = []
self.labels = []
self.W_maps = []
self.centers = []
self.radiuss = []
self.pixel_spacings = []
self.transformation_params = transformation_params
self.transforms = transforms
#-------------------------------------------------------------------------------------
if self.mode == 'train':
self.data_path = os.path.join(self.data_path,'train_set')
elif self.mode == 'valid':
self.data_path = os.path.join(self.data_path,'validation_set')
#-------------------------------------------------------------------------------------
for _, _, f in os.walk(self.data_path):
for file in f:
hdf_file = os.path.join(self.data_path,file)
data = h5py.File(hdf_file,'r') # Dictionary
# Preprocessing of Input Image and Label
patch_img, patch_gt, patch_wmap = PreProcessData(file, data, self.mode, self.transformation_params)
#print(type(data))
self.images.append(patch_img) # 2D image
#print('image shape is : ',patch_img.shape)
self.labels.append(patch_gt) # 2D label
#print('label shape is : ',patch_img.shape)
self.W_maps.append(patch_wmap) # Weight_Map
# self.centers.append(data['roi_center'][:]) # [x,y]
# self.radiuss.append(data['roi_radii'][:]) # [R_min,R_max]
# self.pixel_spacings.append(data['pixel_spacing'][:]) # [x , y , z]
def __len__(self):
return len(self.images)
def __getitem__(self, index):
image = self.images[index]
label = self.labels[index]
W_map = self.W_maps[index]
if self.transforms is not None:
image, label, W_maps = self.transforms(image, label, W_map)
return image, label, W_map
#=================================================================================================
if __name__ == '__main__':
# Test Routinue to check your threaded dataloader
# ACDC dataset has 4 labels
n_labels = 4
path = './hdf5_files'
batch_size = 1
# Data Augmentation Parameters
# Set patch extraction parameters
size1 = (128, 128)
patch_size = size1
mm_patch_size = size1
max_size = size1
train_transformation_params = {
'patch_size': patch_size,
'mm_patch_size': mm_patch_size,
'add_noise': ['gauss', 'none1', 'none2'],
'rotation_range': (-5, 5),
'translation_range_x': (-5, 5),
'translation_range_y': (-5, 5),
'zoom_range': (0.8, 1.2),
'do_flip': (False, False),
}
valid_transformation_params = {
'patch_size': patch_size,
'mm_patch_size': mm_patch_size}
transformation_params = { 'train': train_transformation_params,
'valid': valid_transformation_params,
'n_labels': 4,
'data_augmentation': True,
'full_image': False,
'data_deformation': False,
'data_crop_pad': max_size}
#====================================================================
dataset = NiftiDataset(transformation_params=transformation_params,data_path=path,mode='train')
dataloader = DataLoader(dataset=dataset,batch_size=2,shuffle=True,num_workers=0)
dataiter = iter(dataloader)
data = dataiter.next()
images, labels,W_map = data
#===============================================================================
# Data Visualization
#===============================================================================
print('image: ',images.shape,images.type(),'label: ',labels.shape,labels.type(),
'W_map: ',W_map.shape,W_map.type())
img = transforms.ToPILImage()(images[0,0,:,:,0].float())
lbl = transforms.ToPILImage()(labels[0,0,:,:].float())
W_mp = transforms.ToPILImage()(W_map [0,0,:,:].float())
plt.subplot(1,3,1)
plt.imshow(img,cmap='gray',interpolation=None)
plt.title('image')
plt.subplot(1,3,2)
plt.imshow(lbl,cmap='gray',interpolation=None)
plt.title('label')
plt.subplot(1,3,3)
plt.imshow(W_mp,cmap='gray',interpolation=None)
plt.title('Weight Map')
plt.show()
I have noticed some strange things such as Tensor types are different even though images and labels and weight maps are images with same type and size.
The Error Traceback:
Traceback (most recent call last):
File "D:\Saudi_CV\Vibot\Smester_2\2_Medical Image analysis\Project_2020\OUR_Project\data_loader.py", line 118, in <module>
data = dataiter.next()
File "F:\Download_2019\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py", line 345, in __next__
data = self._next_data()
File "F:\Download_2019\Anaconda3\lib\site-packages\torch\utils\data\dataloader.py", line 385, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
File "F:\Download_2019\Anaconda3\lib\site-packages\torch\utils\data\_utils\fetch.py", line 47, in fetch
return self.collate_fn(data)
File "F:\Download_2019\Anaconda3\lib\site-packages\torch\utils\data\_utils\collate.py", line 79, in default_collate
return [default_collate(samples) for samples in transposed]
File "F:\Download_2019\Anaconda3\lib\site-packages\torch\utils\data\_utils\collate.py", line 79, in <listcomp>
return [default_collate(samples) for samples in transposed]
File "F:\Download_2019\Anaconda3\lib\site-packages\torch\utils\data\_utils\collate.py", line 64, in default_collate
return default_collate([torch.as_tensor(b) for b in batch])
File "F:\Download_2019\Anaconda3\lib\site-packages\torch\utils\data\_utils\collate.py", line 55, in default_collate
return torch.stack(batch, 0, out=out)
RuntimeError: Expected object of scalar type Double but got scalar type Long for sequence element 1 in sequence argument at position #1 'tensors'
[Finished in 19.9s with exit code 1]
The problem was solved through this solution explained on this page link
image = torch.from_numpy(self.images[index]).type(torch.FloatTensor)
label = torch.from_numpy(self.labels[index]).type(torch.FloatTensor)
W_map = torch.from_numpy(self.W_maps[index]).type(torch.FloatTensor)
I wrote a generator for Keras that uses Pytables for getting images from an HDF5 file (see code below).
It works fine, when calling it like so:
self._model.fit_generator(self.training_generator,
epochs=epochs,
validation_data=self.validation_generator,
verbose=1,
callbacks=[model_checkpoint, tensorboard_callback],
use_multiprocessing=True,
# workers=2 # uncommenting this and using more than 1 worker fails
)
However if I use multiple workers (see the commented line above), I get the error shown below. I suspect, that this is related to multiple threads attempting to access the HDF5 file. However, I thought that Pytables and HDF5 is able to handle this for read-only access. So what am I doing wrong?
Bonus-question: Will this code make sure, that during training the model sees a given sample only once for an epoch as mentioned here under Notes?:
Sequence are a safer way to do multiprocessing. This structure
guarantees that the network will only train once on each sample per
epoch which is not the case with generators.
This is the error that I get using more than one workers:
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/lib/python3.7/multiprocessing/pool.py", line 121, in worker
result = (True, func(*args, **kwds))
File "/project/path/venv/lib/python3.7/site-packages/keras/utils/data_utils.py", line 401, in get_index
return _SHARED_SEQUENCES[uid][i]
File "/project/path/python_package/python_package/training_generators.py", line 41, in __getitem__
images, masks, weights = self.__data_generation(indexes)
File "/project/path/python_package/python_package/training_generators.py", line 52, in __data_generation
images, labels = self.__get_images(indexes)
File "/project/path/python_package/python_package/training_generators.py", line 79, in __get_images
labels[counter] = self.tables.root['labels'][i, ...]
File "/project/path/venv/lib/python3.7/site-packages/tables/array.py", line 662, in __getitem__
arr = self._read_slice(startl, stopl, stepl, shape)
File "/project/path/venv/lib/python3.7/site-packages/tables/array.py", line 766, in _read_slice
self._g_read_slice(startl, stopl, stepl, nparr)
File "tables/hdf5extension.pyx", line 1585, in tables.hdf5extension.Array._g_read_slice
tables.exceptions.HDF5ExtError: HDF5 error back trace
File "H5Dio.c", line 216, in H5Dread
can't read data
File "H5Dio.c", line 587, in H5D__read
can't read data
File "H5Dchunk.c", line 2276, in H5D__chunk_read
error looking up chunk address
File "H5Dchunk.c", line 3022, in H5D__chunk_lookup
can't query chunk address
File "H5Dbtree.c", line 1047, in H5D__btree_idx_get_addr
can't get chunk info
File "H5B.c", line 341, in H5B_find
unable to load B-tree node
File "H5AC.c", line 1763, in H5AC_protect
H5C_protect() failed
File "H5C.c", line 2565, in H5C_protect
can't load entry
File "H5C.c", line 6890, in H5C_load_entry
Can't deserialize image
File "H5Bcache.c", line 181, in H5B__cache_deserialize
wrong B-tree signature
End of HDF5 error back trace
Problems reading the array data.
"""
This is the code of my generator:
class DataGenerator(keras.utils.Sequence):
'Generates data for Keras'
def __init__(self, pytables_file_path=None, batch_size=32, shuffle=True, image_processor: ImageProcessor = None,
augment_params=None, image_type=None):
'Initialization'
self.batch_size = batch_size
self.image_type = image_type
self.pytable_file_path = pytables_file_path
self.tables = tables.open_file(self.pytable_file_path, 'r')
self.number_of_samples = self.tables.root[self.image_type].shape[0]
self.image_size = self.tables.root[self.image_type].shape[1:]
self.indexes = list(range(self.number_of_samples))
self.shuffle = shuffle
self.image_processor = image_processor
self.on_epoch_end()
self.augment_params = augment_params
def __del__(self):
self.tables.close()
def __len__(self):
'Denotes the number of batches per epoch'
return int(np.floor(self.number_of_samples / self.batch_size))
def __getitem__(self, index):
'Generate one batch of data'
# Generate indexes of the batch
indexes = self.indexes[index * self.batch_size:(index + 1) * self.batch_size]
# Generate data
images, masks, weights = self.__data_generation(indexes)
mask_wei_arr = np.concatenate((masks, weights[:, :, :, np.newaxis]), axis=-1)
return (images, mask_wei_arr)
def on_epoch_end(self):
"""Run after each epoch."""
if self.shuffle:
np.random.shuffle(self.indexes) # Shuffle indexes after each epoch
def __data_generation(self, indexes):
'Generates data containing batch_size samples' # X : (n_samples, *dim, n_channels)
images, labels = self.__get_images(indexes)
if self.image_processor:
images = self.__process_images(images)
masks, weights = self.generate_masks_and_weights_from_labels(labels)
if self.augment_params:
[images, masks, weights] = self.augment_data(images, masks, weights)
images = images.astype('float32')
masks_new = masks.astype('float32')
weights_new = weights.astype('float32')
weights_new = weights_new[:, :, :, 0]
return images, masks_new, weights_new
def __process_images(self, images):
for ind, image in enumerate(images):
images[ind, ...] = self.image_processor.process(image)
return images
def __get_images(self, indexes):
images = np.empty((self.batch_size, *self.image_size))
labels = np.empty((self.batch_size, *self.image_size))
for counter, i in enumerate(indexes):
current_image = self.tables.root[self.image_type][i, ...]
images[counter] = current_image
labels[counter] = self.tables.root['labels'][i, ...]
return images, labels
def generate_masks_and_weights_from_labels(self, labels):
pass
max_lbl_val = int(np.max(labels))
edges = np.zeros_like(labels).astype(bool)
masks = np.asarray(labels > 0).astype(float)
weights = np.ones_like(labels)
se_size = 3 # use '3': to get 1 pixel dilation; use '5': to get 2 pixel dilation
structure = np.ones((1, se_size, se_size, 1))
for lbl_ind in range(1, max_lbl_val+1): # iterate over labels
label_mask = labels == lbl_ind
label_dilated_edges = scipy.ndimage.morphology.binary_dilation(label_mask, structure) & ~label_mask
label_eroded_edges = ~scipy.ndimage.morphology.binary_erosion(label_mask, structure) & label_mask
label_edges = np.bitwise_or(label_eroded_edges, label_dilated_edges)
edges = np.bitwise_or(edges, label_edges)
weights[edges] *= 10 # weight the edges more by factor 10
return masks, weights
def augment_data(self, images, masks, weights):
# for index, _ in enumerate(images):
# [images[index, :, :, 0], masks[index, :, :, 0], weights[index, :, :, 0]] = data_augmentation(
# [images[index, :, :, 0], masks[index, :, :, 0], weights[index, :, :, 0]], self.augment_params,
# order=[1, 0, 0])
for index, image in enumerate(images):
image = images[index, ...]
mask = masks[index, ...]
weight = weights[index, ...]
[image, mask, weight] = data_augmentation([image, mask, weight], self.augment_params, order=[1, 0, 0])
# fix, ax = plt.subplots(1, 3, figsize=(5, 15))
# ax[0].imshow(image[:, :, 0])
# ax[1].imshow(mask[:, :, 0])
# ax[2].imshow(weight[:, :, 0])
# plt.show()
images[index, ...] = image
masks[index, ...] = mask
weights[index, ...] = weight
return images, masks, weights
I've written a custom generator using Keras sequence, but at the end of first epoch i got:
Attribute Error: Custom Generator object has no attribute 'shape'
Ubuntu 18.04
Cuda 10
Tried Tensorflow 1.13 & 1.14
seeing this page:
https://github.com/keras-team/keras/issues/12586
i tried changing
from keras.utils import Sequence
to
from tensorflow.python.keras.utils.data_utils import Sequence
but no luck!
class CustomGenerator(Sequence):
def __init__(self, ....):
...
# Preallocate memory
if mode == 'train' and self.crop_shape:
self.X = np.zeros((batch_size, crop_shape[0], crop_shape[1], 4), dtype='float32')
# edge
# self.X2 = np.zeros((batch_size, crop_shape[1], crop_shape[0], 3), dtype='float32')
self.Y1 = np.zeros((batch_size, crop_shape[0] // 4, crop_shape[1] // 4, self.n_classes), dtype='float32')
def on_epoch_end(self):
# Shuffle dataset for next epoch
c = list(zip(self.image_path_list, self.label_path_list, self.edge_path_list))
random.shuffle(c)
self.image_path_list, self.label_path_list, self.edge_path_list = zip(*c)
# Fix memory leak (tensorflow.python.keras bug)
gc.collect()
def __getitem__(self, index):
for n, (image_path, label_path,edge_path) in enumerate(
zip(self.image_path_list[index * self.batch_size:(index + 1) * self.batch_size],
self.label_path_list[index * self.batch_size:(index + 1) * self.batch_size],
self.edge_path_list[index * self.batch_size:(index + 1) * self.batch_size])):
image = cv2.imread(image_path, 1)
label = cv2.imread(label_path, 0)
edge = cv2.imread(edge_path, 0)
....
self.X[n] = image
self.Y1[n] = to_categorical(cv2.resize(label, (label.shape[1] // 4, label.shape[0] // 4)),
self.n_classes).reshape((label.shape[0] // 4, label.shape[1] // 4, -1))
self.Y2[n] = to_categorical(cv2.resize(label, (label.shape[1] // 8, label.shape[0] // 8)),
self.n_classes).reshape((label.shape[0] // 8, label.shape[1] // 8, -1))
self.Y3[n] = to_categorical(cv2.resize(label, (label.shape[1] // 16, label.shape[0] // 16)),
self.n_classes).reshape((label.shape[0] // 16, label.shape[1] // 16, -1))
return self.X, [self.Y1, self.Y2, self.Y3]
def __len__(self):
return math.floor(len(self.image_path_list) / self.batch_size)
def random_crop(image, edge, label, random_crop_size=(800, 1600)):
....
return image, label
The error is:
742/743 [============================>.] - ETA: 0s - loss: 1.8465 - conv6_cls_loss: 1.1261 - sub24_out_loss: 1.2478 - sub4_out_loss: 1.3827 - conv6_cls_categorical_accuracy: 0.6705 - sub24_out_categorical_accuracy: 0.6250 - sub4_out_categorical_accuracy: 0.5963Traceback (most recent call last):
File "/home/user/Desktop/Keras-ICNet/train1.py", line 75, in <module>
use_multiprocessing=True, shuffle=True, max_queue_size=10, initial_epoch=opt.epoch)
File "/home/user/.local/lib/python3.6/site-packages/tensorflow/python/keras/engine/training.py", line 1433, in fit_generator
steps_name='steps_per_epoch')
File "/home/user/.local/lib/python3.6/site-packages/tensorflow/python/keras/engine/training_generator.py", line 322, in model_iteration
steps_name='validation_steps')
File "/home/user/.local/lib/python3.6/site-packages/tensorflow/python/keras/engine/training_generator.py", line 144, in model_iteration
shuffle=shuffle)
File "/home/user/.local/lib/python3.6/site-packages/tensorflow/python/keras/engine/training_generator.py", line 480, in convert_to_generator_like
num_samples = int(nest.flatten(data)[0].shape[0])
AttributeError: 'int' object has no attribute 'shape'
Looking at the stack trace,
num_samples = int(nest.flatten(data)[0].shape[0])
AttributeError: 'int' object has no attribute 'shape'
The data actually refers to the validation_data parameter passed in fit_generator. This is supposed to be a generator or tuple. My guess is this is passed as an array as a result of which nest.flatten(data)[0] returns an int and hence the error.