attempting to manually download MNIST pytorch dataset in databricks - pytorch

I've attempted a couple different iterations now to get the dataset manually loaded into databricks's DBFS.. so that PyTorch can load it.. however the MNIST dataset seems to just be some binary file.. is it expected I unzip it first or just.. point to the GZipped tarball? So far all my trials have gotten this error
train_dataset = datasets.MNIST(
13 'dbfs:/FileStore/tarballs/train_images_idx3_ubyte.gz',
14 train=True,
RuntimeError: Dataset not found. You can use download=True to download it
I am aware I can turn Download=True , however due to the firewalls this is not an option and I want to just upload the files and wire them in myself via DBFS... anyone done this as well?
EDIT: #alexey suggested I need to add the extra paths MNIST/raw
And then change the input to
train_dataset = datasets.MNIST(
'/dbfs/FileStore/tarballs',
train=True,
download=False,
transform=transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))]))
data_loader = torch.utils.data.DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
But same error

My code and dir:
train_loader = torch.utils.data.DataLoader(
datasets.MNIST('../colabx/data', train=True, download=False,
transform=transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
....\colabx\data\MNIST\raw>ls
t10k-images-idx3-ubyte train-images-idx3-ubyte
t10k-images-idx3-ubyte.gz train-images-idx3-ubyte.gz
t10k-labels-idx1-ubyte train-labels-idx1-ubyte
t10k-labels-idx1-ubyte.gz train-labels-idx1-ubyte.gz

Related

ValueError: too many values to unpack while using torch tensors

For a project on neural networks, I am using Pytorch and am working with the EMNIST dataset.
The code that is already given loads in the dataset:
train_dataset = dsets.MNIST(root='./data',
train=True,
transform=transforms.ToTensor(),
download=True)
And prepares it:
train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
batch_size=batch_size,
shuffle=True)
Then, when all the configurations of the network are defined, there is a for loop to train the model per epoch:
for i, (images, labels) in enumerate(train_loader):
In the example code this works fine.
For my task, I am given a dataset that I load as follows:
emnist = scipy.io.loadmat("DIRECTORY/emnist-letters.mat")
data = emnist ['dataset']
X_train = data ['train'][0, 0]['images'][0, 0]
y_train = data ['train'][0, 0]['labels'][0, 0]
Then, I create the train_dataset as follows:
train_dataset = np.concatenate((X_train, y_train), axis = 1)
train_dataset = torch.from_numpy(train_dataset)
And use the same step to prepare it:
train_loader = torch.utils.data.DataLoader(dataset=train_dataset,
batch_size=batch_size,
shuffle=True)
However, when I try to use the same loop as before:
for i, (images, labels) in enumerate(train_loader):
I get the following error:
ValueError: too many values to unpack (expected 2)
Who knows what I can do so that I can train my dataset with this loop?
The dataset you created from the EMNIST data is a single tensor, and therefore, the data loader will also produce a single tensor, where the first dimension is the batch dimensions. This results in trying to unpack that tensor across the batch dimension, which doesn't work because your batch size is greater than two, but is also not what you want to happen.
You can use torch.utils.data.TensorDataset to easily create a dataset, which produces a tuple of images and their respective labels, just like the MNIST dataset does.
train_dataset = torch.utils.data.TensorDataset(torch.from_numpy(X_train), torch.from_numpy(y_train))

Speed problem of model.fit() in TF2 when loading data using DataGenerator

I run a simple classification problem with a small dataset on tf2 with two different ways on how to load the data.
In the first way, I loaded the data by reading images and loading them into (train_x, train_y) and (test_w,test_y).
The training was quite fast and fine.
Then, I wanted to try with using DataGenerator as such
training_datagen = ImageDataGenerator(
rescale = 1./255,
rotation_range=15,
fill_mode='nearest')
validation_datagen = ImageDataGenerator(rescale = 1./255)
train_generator = training_datagen.flow_from_directory(
TRAINING_DIR,
target_size=(224,224),
class_mode='categorical'
)
validation_generator = validation_datagen.flow_from_directory(
VALIDATION_DIR,
target_size=(224,224),
class_mode='categorical'
)
and then I run the training with the command
H = model.fit(
train_generator,
batch_size=2,
validation_data= validation_generator,
verbose = 1,
epochs=EPOCHS)
then, the training becomes extremely slow. One epoch lasts several minutes, while in the previous case, the whole training was less than 15 seconds.
I did not understand what is the problem. It seems this problem is shared among several developers but not clear why the training becomes so slow when using a data generator.
Thanks
The issue was also addressed here
https://github.com/keras-team/keras/issues/12683#issuecomment-614963118

training large dataset using keras-flow-from-dataframe generator

What is the best way to train large data-set on google Co-laboratory using keras?
size of data: 3GB images stored on Google drive.
After searching, the problem is that data didn't fetch on memory. The suggested solution in all of the articles I read was to use keras generators (as per what I understand, its role is to fetch a batch and train it, then goes to then next batch.. so on, so no need to fetch the whole data on memory at once).
I tried keras-flow-from-dataframe generator. But it didn't solve the problem and I'm still suffering from (Runtime Died)
train_paths = pd.read_csv(path)
datagen = ImageDataGenerator(featurewise_center=True,
featurewise_center=True,
featurewise_std_normalization=True,
samplewise_std_normalization=True,
rotation_range=30,
validation_split=0.25)
train_generator=datagen.flow_from_dataframe(
dataframe=train_paths,directory= None,x_col='path',y_col='label',
subset="training",has_ext=True,
batch_size=32,target_size =(224,224),
color_mode= "rgb",seed=0,
shuffle=False,class_mode="binary",drop_duplicates=False)
def compile_and_train(model, num_epochs):
adam = Adam(lr=0.0001, beta_1=0.9, beta_2=0.999, epsilon=None, decay=10, amsgrad=False)
model.compile(optimizer= adam, loss='binary_crossentropy', metrics = ['acc'])
filepath = 'tmp/weights/' + model.name + '.{epoch:02d}-{loss:.2f}.hdf5'
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=0, save_best_only=True, mode='auto', period=1)
STEP_SIZE_TRAIN= (train_generator.n//train_generator.batch_size)+1
STEP_SIZE_VALID= (valid_generator.n//valid_generator.batch_size)+1
Model_history=model.fit_generator(generator=train_generator,steps_per_epoch=STEP_SIZE_TRAIN,validation_data=valid_generator,validation_steps=STEP_SIZE_VALID,epochs=num_epochs, verbose=1, callbacks=[checkpoint, tensor_board],class_weight=[1])
return Model_history
MobileNet_Model= MobileNet_model(input_shape)
MobileNet_model_his= compile_and_train(MobileNet_Model, num_epochs=1)
One suggested solution is to divide the data manually(or by for loop), save the weights after each MAJOR BATCH and continue training them for the next batch...
A question here, should I save the model(architecture) or the weights only ? And is there any better solutions instead of (for loop)? and why using keras generators don't solve this problem at all!!?

How do I load custom image based datasets into Pytorch for use with a CNN?

I have searched for hours on the internet to find a good solution to my issue. Here is some relevant background information to help you answer my question.
This is my first ever deep learning project and I have no idea what I am doing. I know the theory but not the practical elements.
The data that I am using can be found on kaggle at this link:
(https://www.kaggle.com/alxmamaev/flowers-recognition)
I am aiming to classify flowers based on the images provided in the dataset using a CNN.
Here is some sample code I have tried to use to load data in so far, this is my best attempt but as I mentioned I am clueless and Pytorch docs didn't offer much help that I could understand at my level.
(https://pastebin.com/fNLVW1UW)
# Loads the images for use with the CNN.
def load_images(image_size=32, batch_size=64, root="../images"):
transform = transforms.Compose([
transforms.Resize(32),
transforms.ToTensor(),
transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
train_set = datasets.ImageFolder(root=root, train=True, transform=transform)
train_loader = torch.utils.data.DataLoader(train_set, batch_size=batch_size, shuffle=True, num_workers=2)
return train_loader
# Defining variables for use with the CNN.
classes = ('daisy', 'dandelion', 'rose', 'sunflower', 'tulip')
train_loader_data = load_images()
# Training samples.
n_training_samples = 3394
train_sampler = SubsetRandomSampler(np.arange(n_training_samples, dtype=np.int64))
# Validation samples.
n_val_samples = 424
val_sampler = SubsetRandomSampler(np.arange(n_training_samples, n_training_samples + n_val_samples, dtype=np.int64))
# Test samples.
n_test_samples = 424
test_sampler = SubsetRandomSampler(np.arange(n_test_samples, dtype=np.int64))
Here are my direct questions that I require answers too:
How do I fix my code to load in the dataset in an 80/10/10 split for training/test/validation?
How do i create the required labels/classes for these images which are already divided by folders in /images ?
Looking at the data from Kaggle and your code, there are problems in your data loading.
The data should be in a different folder per class label for PyTorch ImageFolder to load it correctly. In your case, since all the training data is in the same folder, PyTorch is loading it as one train set. You can correct this by using a folder structure like - train/daisy, train/dandelion, test/daisy, test/dandelion and then passing the train and the test folder to the train and test ImageFolder respectively. Just change the folder structure and you should be good. Take a look at the official documentation of torchvision.datasets.Imagefolder which has a similar example.
As you said, these images which are already divided by folders in /images. PyTorch ImageFolder assumes that images are organized in the following way. But this folder structure is only correct if you are using all the images for train set:
```
/images/daisy/100080576_f52e8ee070_n.jpg
/images/daisy/10140303196_b88d3d6cec.jpg
.
.
.
/images/dandelion/10043234166_e6dd915111_n.jpg
/images/dandelion/10200780773_c6051a7d71_n.jpg
```
where 'daisy', 'dandelion' etc. are class labels.
The correct folder structure if you want to split the dataset into train and test set in your case (note that I know you want to split the dataset into train, validation, and test set, but it doesn't matters as this is just an example to get the idea out):
```
/images/train/daisy/100080576_f52e8ee070_n.jpg
/images/train/daisy/10140303196_b88d3d6cec.jpg
.
.
/images/train/dandelion/10043234166_e6dd915111_n.jpg
/images/train/dandelion/10200780773_c6051a7d71_n.jpg
.
.
/images/test/daisy/300080576_f52e8ee070_n.jpg
/images/test/daisy/95140303196_b88d3d6cec.jpg
.
.
/images/test/dandelion/32143234166_e6dd915111_n.jpg
/images/test/dandelion/65200780773_c6051a7d71_n.jpg
```
Then, you can refer to the following full code example on how to write a dataloader:
import os
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable
import torch.utils.data as data
import torchvision
from torchvision import transforms
EPOCHS = 2
BATCH_SIZE = 10
LEARNING_RATE = 0.003
TRAIN_DATA_PATH = "./images/train/"
TEST_DATA_PATH = "./images/test/"
TRANSFORM_IMG = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(256),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225] )
])
train_data = torchvision.datasets.ImageFolder(root=TRAIN_DATA_PATH, transform=TRANSFORM_IMG)
train_data_loader = data.DataLoader(train_data, batch_size=BATCH_SIZE, shuffle=True, num_workers=4)
test_data = torchvision.datasets.ImageFolder(root=TEST_DATA_PATH, transform=TRANSFORM_IMG)
test_data_loader = data.DataLoader(test_data, batch_size=BATCH_SIZE, shuffle=True, num_workers=4)
class CNN(nn.Module):
# omitted...
if __name__ == '__main__':
print("Number of train samples: ", len(train_data))
print("Number of test samples: ", len(test_data))
print("Detected Classes are: ", train_data.class_to_idx) # classes are detected by folder structure
model = CNN()
optimizer = torch.optim.Adam(model.parameters(), lr=LEARNING_RATE)
loss_func = nn.CrossEntropyLoss()
# Training and Testing
for epoch in range(EPOCHS):
for step, (x, y) in enumerate(train_data_loader):
b_x = Variable(x) # batch x (image)
b_y = Variable(y) # batch y (target)
output = model(b_x)[0]
loss = loss_func(output, b_y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
if step % 50 == 0:
test_x = Variable(test_data_loader)
test_output, last_layer = model(test_x)
pred_y = torch.max(test_output, 1)[1].data.squeeze()
accuracy = sum(pred_y == test_y) / float(test_y.size(0))
print('Epoch: ', epoch, '| train loss: %.4f' % loss.data[0], '| test accuracy: %.2f' % accuracy)
There now exists an easy package for the splitting, called 'split-folders'. See here.
E.g.
import splitfolders
splitfolders.ratio(image_path, output="output", seed=43, ratio=(.8,.1,.1))

How do I use the Tensorboard callback of Keras?

I have built a neural network with Keras. I would visualize its data by Tensorboard, therefore I have utilized:
keras.callbacks.TensorBoard(log_dir='/Graph', histogram_freq=0,
write_graph=True, write_images=True)
as explained in keras.io. When I run the callback I get <keras.callbacks.TensorBoard at 0x7f9abb3898>, but I don't get any file in my folder "Graph". Is there something wrong in how I have used this callback?
keras.callbacks.TensorBoard(log_dir='./Graph', histogram_freq=0,
write_graph=True, write_images=True)
This line creates a Callback Tensorboard object, you should capture that object and give it to the fit function of your model.
tbCallBack = keras.callbacks.TensorBoard(log_dir='./Graph', histogram_freq=0, write_graph=True, write_images=True)
...
model.fit(...inputs and parameters..., callbacks=[tbCallBack])
This way you gave your callback object to the function. It will be run during the training and will output files that can be used with tensorboard.
If you want to visualize the files created during training, run in your terminal
tensorboard --logdir path_to_current_dir/Graph
This is how you use the TensorBoard callback:
from keras.callbacks import TensorBoard
tensorboard = TensorBoard(log_dir='./logs', histogram_freq=0,
write_graph=True, write_images=False)
# define model
model.fit(X_train, Y_train,
batch_size=batch_size,
epochs=nb_epoch,
validation_data=(X_test, Y_test),
shuffle=True,
callbacks=[tensorboard])
Change
keras.callbacks.TensorBoard(log_dir='/Graph', histogram_freq=0,
write_graph=True, write_images=True)
to
tbCallBack = keras.callbacks.TensorBoard(log_dir='Graph', histogram_freq=0,
write_graph=True, write_images=True)
and set your model
tbCallback.set_model(model)
Run in your terminal
tensorboard --logdir Graph/
If you are working with Keras library and want to use tensorboard to print your graphs of accuracy and other variables, Then below are the steps to follow.
step 1: Initialize the keras callback library to import tensorboard by using below command
from keras.callbacks import TensorBoard
step 2: Include the below command in your program just before "model.fit()" command.
tensor_board = TensorBoard(log_dir='./Graph', histogram_freq=0, write_graph=True, write_images=True)
Note: Use "./graph". It will generate the graph folder in your current working directory, avoid using "/graph".
step 3: Include Tensorboard callback in "model.fit()".The sample is given below.
model.fit(X_train,y_train, batch_size=batch_size, epochs=nb_epoch, verbose=1, validation_split=0.2,callbacks=[tensor_board])
step 4 : Run your code and check whether your graph folder is there in your working directory. if the above codes work correctly you will have "Graph"
folder in your working directory.
step 5 : Open Terminal in your working directory and type the command below.
tensorboard --logdir ./Graph
step 6: Now open your web browser and enter the address below.
http://localhost:6006
After entering, the Tensorbaord page will open where you can see your graphs of different variables.
Here is some code:
K.set_learning_phase(1)
K.set_image_data_format('channels_last')
tb_callback = keras.callbacks.TensorBoard(
log_dir=log_path,
histogram_freq=2,
write_graph=True
)
tb_callback.set_model(model)
callbacks = []
callbacks.append(tb_callback)
# Train net:
history = model.fit(
[x_train],
[y_train, y_train_c],
batch_size=int(hype_space['batch_size']),
epochs=EPOCHS,
shuffle=True,
verbose=1,
callbacks=callbacks,
validation_data=([x_test], [y_test, y_test_coarse])
).history
# Test net:
K.set_learning_phase(0)
score = model.evaluate([x_test], [y_test, y_test_coarse], verbose=0)
Basically, histogram_freq=2 is the most important parameter to tune when calling this callback: it sets an interval of epochs to call the callback, with the goal of generating fewer files on disks.
So here is an example visualization of the evolution of values for the last convolution throughout training once seen in TensorBoard, under the "histograms" tab (and I found the "distributions" tab to contain very similar charts, but flipped on the side):
In case you would like to see a full example in context, you can refer to this open-source project: https://github.com/Vooban/Hyperopt-Keras-CNN-CIFAR-100
If you are using google-colab simple visualization of the graph would be :
import tensorboardcolab as tb
tbc = tb.TensorBoardColab()
tensorboard = tb.TensorBoardColabCallback(tbc)
history = model.fit(x_train,# Features
y_train, # Target vector
batch_size=batch_size, # Number of observations per batch
epochs=epochs, # Number of epochs
callbacks=[early_stopping, tensorboard], # Early stopping
verbose=1, # Print description after each epoch
validation_split=0.2, #used for validation set every each epoch
validation_data=(x_test, y_test)) # Test data-set to evaluate the model in the end of training
Create the Tensorboard callback:
from keras.callbacks import TensorBoard
from datetime import datetime
logDir = "./Graph/" + datetime.now().strftime("%Y%m%d-%H%M%S") + "/"
tb = TensorBoard(log_dir=logDir, histogram_freq=2, write_graph=True, write_images=True, write_grads=True)
Pass the Tensorboard callback to the fit call:
history = model.fit(X_train, y_train, epochs=200, callbacks=[tb])
When running the model, if you get a Keras error of
"You must feed a value for placeholder tensor"
try reseting the Keras session before the model creation by doing:
import keras.backend as K
K.clear_session()
You wrote log_dir='/Graph' did you mean ./Graph instead? You sent it to /home/user/Graph at the moment.
You should check out Losswise (https://losswise.com), it has a plugin for Keras that's easier to use than Tensorboard and has some nice extra features. With Losswise you'd just use from losswise.libs import LosswiseKerasCallback and then callback = LosswiseKerasCallback(tag='my fancy convnet 1') and you're good to go (see https://docs.losswise.com/#keras-plugin).
There are few things.
First, not /Graph but ./Graph
Second, when you use the TensorBoard callback, always pass validation data, because without it, it wouldn't start.
Third, if you want to use anything except scalar summaries, then you should only use the fit method because fit_generator will not work. Or you can rewrite the callback to work with fit_generator.
To add callbacks, just add it to model.fit(..., callbacks=your_list_of_callbacks)

Resources