I want to use Dask to read a large dataset and feed with it a Keras model. The data consists of audio files and I am using a custom function to read them. I have tried to apply delayed to this function and I collect all of the files in a dask array, as:
x = da.stack([da.from_delayed(delayed(get_item_data)(fp, sr, mono, post_processing, data_shape), shape=data_shape, dtype=np.float32) for fp in df['path']])
(See the source)
To train the Keras model, I compute X and Y as above and I input them to the function fit.
However, the training is very slow. I have tried to change the chunksizeand it is still very slow.
Could you tell me if I am doing something wrong when creating the array? Or any good practices for it?
Thanks
As far as I know Keras doesn't have any built-in support for Dask.arrays. So I'm not sure what will happen when you provide a dask.array directly to Keras functions. My guess is that it will automatically convert the dask.array into a (possibly very large) numpy array.
Related
I am importing MNIST dataset as train_data_MNIST = torchvision.datasets.MNIST(root=path+"MNIST", train=True,transform=transforms, download=True)and I am trying to make a smaller dataset from MNIST, let's say the first 10,000 images and corresponding labels. I know this can be handled with torch.utils.data.Subset. But what I want is a torchvision.datasets object (if I directly apply torch.utils.data.Subset to the train_data_MNIST that I list above, the result is an object from torch.utils.data.Subset class).
Is there any possible way such that I can use a fraction of the original MNIST dataset to create a new dataset (not subset)?
Thanks in advance.
What about modifying data and targets directly? For example:
dataset = torchvision.datasets.MNIST(root=path+"MNIST", train=True,transform=transforms, download=True)
dataset.data = dataset.data[:10000]
dataset.targets = dataset.targets[:10000]
I have a bunch of tensor operations (matmul, transpose, etc..) I would like to run on a large dataset.
Since they are still matrix operations, and since I am using Keras generators to load the data batches, It would make sense to use GPUs to compute them.
Now, I've searched a while and I can't seem to find which is the correct way to use Keras to do parallel GPU operations, using generators, outside of the standard Model object interface.
Does anyone know how to do it? Thanks!
I found out data augmentation can be done in PyTorch by using torchvision.transforms. I also read that transformations are apllied at each epoch. So I'm wondering whether or not the effect of copying each sample multiple times and then applying random transformation to them is same as using torchvision.transforms on original data set(unique images) and just training it for a longer time(more epochs).
Thanks in advance.
This is a question to be answered in a broad scale. don't get misunderstood that the TorchVision Transforms doesn't increase your dataset. It applies random or non-random transforms to your current data set at runtime. (hence unique each time and each epoch).
the effect of copying each sample multiple times and then applying random transformation to them is same as using torchvision.transforms on original data set(unique images) and just training it for a longer time(more epochs).
Answer-
To increase your dataset, you can copy paste, also use pyTorch or WEKA software. However, more epochs are a totally different concept to this. Of course, the more epochs you use, the better the model will be (only till the validation loss and training loss intersect each other)
Hope this helps.
I am struggling with the following points:
When should bcolz be used instead of keras' data generator? Looks like the keras' model has apis to accept an array with batch or define the data generator as well.
Is there a performance improvement when using bcolz with fit() api over using a data generator with fit_generator()?
Finally, there's a fastai post mentioning dask at this post
Is dask better than bcolz?
Thanks!
Keras data generator's flow_from_directory(directory) takes in ' PNG, JPG, BMP or PPM' images only, ofc you could extend it but bcolz is a quick fix. Which is why bcolz is perfect for pre-computed convolution features. Thus, save those features as bcolz arrays and load them into batches for fit_generator.
fit_generator() with data generator (could be bcolz datagenerator) would be quicker than fit on just bcolz.
Is Dask better than bcolz? Dask isn't strictly an alternative for bcolz, Dask can work with bcolz arrays. And in tasks with huge datasets, it can provde a speed up because it has great support for parallelism. Bcolz is a nice compressed data container and I'd suggest using dask on top of bcolz if you need that speed up.
I am trying to model a neural network using tensorflow.
But the matrices are in the order of 800000x300000.When I initialize the variables using global variable initializer in tensorflow, the system freezes. How to do deal with this problem?
Could tensorflow with gpu support will be able to handle this large matrix?
You can divide data set into batches and then process you model or you can use tensor flow queue