I am trying to train keras model with very large number of images and labels. I want to use the model.fit_generator and somehow flow the input images and labels from memory because we prepare all the data in memory, after the image is loaded. The problem is that we have plenty of large images that we then clip into smaller size and provide it like that to the model. We need a for loop inside a While loop.
Something like this:
While True:
for file in files: #lets say that there are 500 files (images)
image = ReadImage (file)
X = prepareImage(image) # here it is cut and prepared in specific shape
Y = labels
yield X[batch_start:batch_end],Y[batch_start:batch_end]
After it yields the last batch for the first image we need to load the next image in the for loop, prepare the data and yield again in the same epoch. For the second epoch we need again all the images. The problem here is that we prepare everything in memory, from 1 image we create millions of training data and then move to next image. We cannot write all the data to the disk and flow_from_directory, since it would require plenty of disk space.
Any hint?
Related
I have a zipped radar observations with size of 110GB.
The structure of zipped data looks like below
year_1.tar/
date_1.gz/
time_1
time_2
...
time_365
date_2.gz/
...
date_365.gz/
year_2.tar/
...
year_n.tar/
time_n is binary file that includes headers and data (2D array in size of 1500*2100 approximately).
Initially, I completely unzipped the data and extract a certain region (512*512) of array and stored in .dat as binary file with no headers, so the file can be simply read. These extracted data are 250GB. Therefore, I read these files as memory-maps and iterate them with PyTorch Dataset.
However, recently, I tried to fully exploited the whole data in full size (1500*2100), so I was wondering is it a better way to completely unzipped the data again or I should wrapped the unzipped function in PyTorch Dataset and leave the data unzipped.
My current way to store the extracted data is shown below
year_1.dat -> shape=(N, H, W)
year_2.dat -> shape=(N, H, W)
...
year_n.dat -> shape=(N, H, W)
I am working on a deep learning model that uses a large amount of time series related data. As the data is too big to be loaded in RAM at once, I will use keras train_on_batch to train the model reading data from disk.
I am looking for a simple and fast process to split the data among train, validation and test folders.
I´ve tried "splitfolder" function, but could not deactivate the data shuffling (what is inapropriate for time series related data). Arguments on this function documentation does not inclued an option to turn shuffle on/off.
Code I´ve tried:
import splitfolders
input_folder = r"E:\Doutorado\apagar"
splitfolders.ratio(input_folder, output = r'E:\Doutorado\apagardivididos', ratio=(0.7, 0.2, 0.1),
group_prefix=None)
Resulting split data is shuffled, but this shuffle is a problem for my time series analysis...
source: https://pypi.org/project/split-folders/
splitfolders.ratio("input_folder", output="output",
seed=1337, ratio=(.8, .1, .1), group_prefix=None, move=False) # default values
Usage:
splitfolders [--output] [--ratio] [--fixed] [--seed] [--oversample] [--group_prefix] [--move] folder_with_images
Options:
--output path to the output folder. defaults to output. Get created if non-existent.
--ratio the ratio to split. e.g. for train/val/test .8 .1 .1 -- or for train/val .8 .2 --.
--fixed set the absolute number of items per validation/test set. The remaining items constitute
the training set. e.g. for train/val/test 100 100 or for train/val 100.
Set 3 values, e.g. 300 100 100, to limit the number of training values.
--seed set seed value for shuffling the items. defaults to 1337.
--oversample enable oversampling of imbalanced datasets, works only with --fixed.
--group_prefix split files into equally-sized groups based on their prefix
--move move the files instead of copying
I have 2 fairly large datasets (~47GB) stored in a netCDF file. The datasets have three dimensions: time, s, and s1. The first dataset is of shape (3000,2088,1000) and the second is of shape (1566,160000,25). Both datasets are equal in size. The only difference is their shape. Since my RAM size is only 32GB, I am accessing the data in blocks.
For the first dataset, when I read the first ~12GB chunk of data, the code uses almost twice the amount of memory. Whereas, for the second, it uses just the amount of memory as that of the chunk (12GB). Why is this happening? How do I stop the code from using more than what is necessary?
Not using more memory is very important for my code because my algorithm's efficiency hinges on the fact that every line of code uses just enough memory and not more. Also, because of this weird behaviour, my system starts swapping like crazy. I have a linux system, if that information is useful. And I use python 3.7.3 with netCDF 4.6.2
This is how I am accessing the datasets,
from netCDF4 import Dataset
dat = Dataset('dataset1.nc')
dat1 = Dataset('dataset2.nc')
chunk1 = dat.variables['data'][0:750] #~12GB worth of data uses ~24GB RAM memory
chunk2 = dat1.variables['data'][0:392] #~12GB worth of data uses ~12GB RAM memory
I am working on a deep learning project (image segmentation) and decided to move my work to google colab. I uploaded the notebook and the data then used the following code to mount the drive
from google.colab import drive
drive.mount('/content/mydrive')
The data is in the format of two folders; one containing the images (input data, in .jpg format) and the other contains their masks (ground Truth, in .png format) each 2600 image. I use the following code to load them.
filelist_trainx = sorted(glob.glob('drive/My Drive/Data/Trainx/*.jpg'), key=numericalSort)
X_train = np.array([np.array(Image.open(fname)) for fname in filelist_trainx])
filelist_trainy = sorted(glob.glob('drive/My Drive/Data/Trainy/*.png'), key=numericalSort)
Y_train = np.array([np.array(Image.open(fname)) for fname in filelist_trainy])
When loading the X_train, it does not take any time ,but when running the Y_train it takes so long and I end interrupting the execution of the cell. Anyone knows why does this happen? considering that both of the files contains data of the same dimension and low storage -18 MB total-. here is a sample of the images.
Data sample
When setting up a data input pipeline to Tensorflow (web cam images), a large amount of time is spent loading the data from the system RAM to the GPU memory.
I am trying to feed a constant stream of images (1024x1024) through my object detection network. I'm currently using a V100 on AWS to perform inference.
The first attempt was with a simple feed dict operation.
# Get layers
img_input_tensor = sess.graph.get_tensor_by_name('import/input_image:0')
img_anchors_input_tensor = sess.graph.get_tensor_by_name('import/input_anchors:0')
img_meta_input_tensor = sess.graph.get_tensor_by_name('import/input_image_meta:0')
detections_input_tensor = sess.graph.get_tensor_by_name('import/output_detections:0')
detections = sess.run(detections_input_tensor,
feed_dict={img_input_tensor: molded_image, img_meta_input_tensor: image_meta, img_anchors_input_tensor: image_anchor})
This produced inference times around 0.06 ms per image.
However, after reading the Tensorflow manual I noticed that the tf.data API was recommended for loading data for inference.
# setup data input
data = tf.data.Dataset.from_tensors((img_input_tensor, img_meta_input_tensor, img_anchors_input_tensor, detections_input_tensor))
iterator = data.make_initializable_iterator() # create the iterator
next_batch = iterator.get_next()
# load data
sess.run(iterator.initializer,
feed_dict={img_input_tensor: molded_image, img_meta_input_tensor: image_meta, img_anchors_input_tensor: image_anchor})
# inference
detections = sess.run([next_batch])[0][3]
This sped up inference time to 0.01ms, put the time taken to load the data took 0.1 ms. This Iterator methods is much longer than the 'slower' feed_dict method significantly. Is there something I can do to speed up the loading process?
Here is a great guide on data pipeline optimization. I personally find the .prefetch method the easiest way to boost your input pipeline. However, the article provides much more advanced techniques.
However, if your input data is not in tfrecords, but you feed it by yourself, you have to implement the described techniques (buffering, interleaved operations) somehow by yourself.