How can I cross-validate by Pytorch and Optuna - pytorch

I want to use cross-validation against the official Optuna and pytorch-based sample code (https://github.com/optuna/optuna/blob/master/examples/pytorch_simple.py).
I thought about splitting the data for cross-validation and trying parameter tuning for each fold, but it seems that the average accuracy of each parameter cannot be obtained because the parameters that can be checked in study.trials_dataframe() are different each time.

I think we need to evaluate all folds and calculate the mean inside an objective function. I create an example notebook, so please take a look.
In the notebook, I slightly modified the objective function to pass the dataset with the arguments and added a wrapper function objective_cv to call the objective function with the split dataset. Then, I optimized the objective_cv instead of the objective function.
def objective(trial, train_loader, valid_loader):
# Remove the following line.
# train_loader, valid_loader = get_mnist()
...
return accuracy
def objective_cv(trial):
# Get the MNIST dataset.
dataset = datasets.MNIST(DIR, train=True, download=True, transform=transforms.ToTensor())
fold = KFold(n_splits=3, shuffle=True, random_state=0)
scores = []
for fold_idx, (train_idx, valid_idx) in enumerate(fold.split(range(len(dataset)))):
train_data = torch.utils.data.Subset(dataset, train_idx)
valid_data = torch.utils.data.Subset(dataset, valid_idx)
train_loader = torch.utils.data.DataLoader(
train_data,
batch_size=BATCHSIZE,
shuffle=True,
)
valid_loader = torch.utils.data.DataLoader(
valid_data,
batch_size=BATCHSIZE,
shuffle=True,
)
accuracy = objective(trial, train_loader, valid_loader)
scores.append(accuracy)
return np.mean(scores)
study = optuna.create_study(direction="maximize")
study.optimize(objective_cv, n_trials=20, timeout=600)

Related

Pytorch batch-wise Augmentation

I'm trying to fix the class imbalance of my dataset for a classification-task by adding augmented images. As the network didn't improve, I noticed, that I'm transforming the whole dataset while not keeping the original image.
What is the best method to fix that?
My training function looks like this (excerpt):
def train(mu,lr,batch_size,n_epochs,k,model,use_gpu,size_image,seed,num_workers,root):
set_seed(seed, use_gpu)
train_loader, test_loader, dataset_attributes = get_data(size_image,root,batch_size, num_workers)
criteria = CrossEntropyLoss()
optimizer = SGD(model.parameters(), lr=lr, momentum=0.9, weight_decay=mu, nesterov=True)
best_acc = 0.0
for epoch in tqdm(range(n_epochs), desc='epoch', position=0):
t = time.time()
optimizer = update_optimizer(optimizer, lr_schedule=dataset_attributes['lr_schedule'], epoch=epoch)
loss_epoch_train, f1_epoch_train, acc_epoch_train, topk_acc_epoch_train = train_epoch(model, optimizer, train_loader,
criteria, loss_train, f1_train, acc_train,
topk_acc_train, k,
dataset_attributes['n_train'],
use_gpu)
if acc_epoch_test > best_acc:
best_acc = acc_epoch_test
save(model, optimizer, epoch, os.path.join(save_dir, 'weights_best_acc.tar'))
This in an excerpt of my get_data function:
def get_data(size_image,root,batch_size, num_workers):
transform = transforms.Compose(
[MaxCenterCrop(),
transforms.Resize(size_image),
transforms.ToTensor()])
trainset = Plantnet(root, 'images_train', transform=transform)
testset = Plantnet(root, 'images_test', transform=transform)
train_class_to_num_instances = Counter(trainset.targets)
test_class_to_num_instances = Counter(testset.targets)
...
sampler = WeightedRandomSampler(torch.DoubleTensor(weights), int(num_samples))
trainloader = torch.utils.data.DataLoader(trainset, batch_size=batch_size,
sampler=sampler,
shuffle=False, num_workers=num_workers)
testloader = torch.utils.data.DataLoader(testset, batch_size=batch_size,
shuffle=False, num_workers=num_workers)
return trainloader, testloader, dataset_attributes
Now my idea for an easy fix would be to add a transformed dataset and concatenate it to the original one. But I think this idea would have a bad impact on performance and wouldn't really fix the problem of class imbalance.
I'm thinking that applying the tranformation on each batch would make the most sense. But how do I add this to my code?

How do you test a custom dataset in Pytorch?

I've been following tutorials in Pytorch that use datasets from Pytorch that allow you to enable whether you'd like to train using the data or not... But now I'm using a .csv and a custom dataset.
class MyDataset(Dataset):
def __init__(self, root, n_inp):
self.df = pd.read_csv(root)
self.data = self.df.to_numpy()
self.x , self.y = (torch.from_numpy(self.data[:,:n_inp]),
torch.from_numpy(self.data[:,n_inp:]))
def __getitem__(self, idx):
return self.x[idx, :], self.y[idx,:]
def __len__(self):
return len(self.data)
How can I tell Pytorch not to train my test_dataset so I can use it as a reference of how accurate my model is?
train_dataset = MyDataset("heart.csv", input_size)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle =True)
test_dataset = MyDataset("heart.csv", input_size)
test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle =True)
In pytorch, a custom dataset inherits the class Dataset. Mainly it contains two methods __len__() is to specify the length of your dataset object to iterate over and __getitem__() to return a batch of data at a time.
Once the dataloader objects are initialized (train_loader and test_loader as specified in your code), you need to write a train loop and a test loop.
def train(model, optimizer, loss_fn, dataloader):
model.train()
for i, (input, gt) in enumerate(dataloader):
if params.use_gpu: #(If training using GPU)
input, gt = input.cuda(non_blocking = True), gt.cuda(non_blocking = True)
predicted = model(input)
loss = loss_fn(predicted, gt)
optimizer.zero_grad()
loss.backward()
optimizer.step()
and your test loop should be:
def test(model,loss_fn, dataloader):
model.eval()
for i, (input, gt) in enumerate(dataloader):
if params.use_gpu: #(If training using GPU)
input, gt = input.cuda(non_blocking = True), gt.cuda(non_blocking = True)
predicted = model(input)
loss = loss_fn(predicted, gt)
In additional you can use metrics dictionary to log your predicted, loss, epochs etc,. The main difference between training and test loop is that we exclude back propagation (zero_grad(), backward(), step()) in inference stage.
Finally,
for epoch in range(1, epochs + 1):
train(model, optimizer, loss_fn, train_loader)
test(model, loss_fn, test_loader)
There are a couple of things to note when you're testing in pytorch:
Put your model into evaluation mode so that things like dropout and batch normalization aren't in training mode: model.eval()
Put a wrapper around your testing code to avoid the computation of gradients (saving memory and time): with torch.no_grad():
Normalise or standardise your data according to your training set only. This is important for min/max normalisation or z-score standardisation so that the model accurately reflects test performance.
Other than that, what you've written looks pretty fine to me, as you're not applying any transforms to your data (for example, image flipping or gaussian noise injections). To show what code should look like in test mode, see below:
for e in range(num_epochs):
for B, (dat, label) in enumerate(train_loader):
#transforms here
opt.zero_grad()
out = model(dat.to(device))
loss = criterion(out)
loss.backward()
opt.step()
with torch.no_grad():
model.eval()
global_corr = 0
for B, (dat,label) in enumerate(test_loader):
out = model(dat.to(device))
# get batch eval metrics here!

Kears fit_generator() with y=None

I am playing with Variational Autoencoders and would like to adapt a Keras example found on GitHub.
Basically, the example is very simple based on mnist dataset and I would like to implement on a more difficult set as it is more realistic.
Code I'm trying to modify:
vae_dfc.fit(
x_train,
epochs=epochs,
steps_per_epoch=train_size//batch_size,
validation_data=(x_val),
validation_steps=val_size//batch_size,
verbose=1
)
With more complex datasets it is nearly impossible to load everything on memory so we need to use fit_generator() to train the model. But it doesn't seem able to handle this:
image_generator = image.ImageDataGenerator(
rescale=1./255,
validation_split=0.2
)
train_generator = image_generator.flow_from_directory(
dir,
class_mode=None,
color_mode='rgb',
target_size=(ORIGINAL_SHAPE[0], ORIGINAL_SHAPE[1]),
batch_size=BATCH_SIZE,
subset='training'
)
vae.fit_generator(
train_generator,
epochs=EPOCHS,
steps_per_epoch=train_generator.samples // BATCH_SIZE,
validation_data=validation_generator,
validation_steps=validation_generator.samples // BATCH_SIZE
)
My understanding is that class_mode=None is producing an output similar to the original simple example, but the fit_generator() is unable to handle this. Are there any workarounds to deal with the fit generator error?
Configurations:
tensorflow-gpu==1.12.0
Python 3.6
Windows 10
Cuda 9.0
Full error:
File "xxx\venv\lib\site-packages\tensorflow\python\keras\engine\training.py",
line 2177, in fit_generator
initial_epoch=initial_epoch)
File "xxx\venv\lib\site-packages\tensorflow\python\keras\engine\training_generator.py",
line 162, in fit_generator
'or (x, y). Found: ' + str(generator_output)) ValueError: Output of generator should be a tuple (x, y, sample_weight) or (x, y).
Found: [[[[0.48627454 0.34901962 0.2901961 ] ....]]]
An autoencoder needs outputs = inputs. It's different from not having outputs.
I believe you can try class_mode='input'.
If this doesn't work, you can create a wrapper generator for outputting both:
class AutoencGenerator(keras.utils.Sequence):
def __init__(self, originalGenerator):
self.generator = originalGenerator
def __len__(self):
return len(self.generator)
def __getitem__(self, i):
x = self.generator[i]
return x, x
def on_epoch_end(self):
self.generator.on_epoch_end() #this only if there is an on_epoch_end in the original
train_autoenc_generator = AutoencGenerator(train_generator)
Both options will need that your model has outputs, of course. If the model was created without outputs (unusual), make it output the results and use the loss function in model.compile(loss=the_loss).
Example of VAE
inputs = Input(shape)
means, sigmas = encoder(inputs)
def encode(x):
means, sigmas = x
randomSamples = tf.random_normal(K.shape(means)) #samples
encoded = (sigmas * randomSamples) + means
return encoded
encodings = Lambda(encode)([means, sigmas])
outputs = decoder(encodings)
kl_loss = some_tensor_function(means, sigmas)
VAE = Model(inputs, outputs)
VAE.add_loss(kl_loss)
VAE.compile(loss = 'mse', optimizer='adam')
Train with the generator:
VAE.fit_generator(train_autoenc_generator, ...)

Sklearn models return weights of zero on small dataset

I'm currently trying to use sklearn to correlate population stagnation and happiness internationally. I've prepared and cleaned datasets with pandas, but for some reason, any model I try will not train. One of my columns for the data was countries, so I used pandas get_dummies function to account for feeding strings into the models. My shapes for the training and testing variables are as follows: (617, 67),(617,),(151, 67),(151,).
rf_class = RandomForestClassifier(n_estimators=5)
log_class = LogisticRegression()
svm_class = SVC(kernel='rbf', C=1E11, verbose=False)
def run(model, model_name='this model', trainX=trainX, trainY=trainY, testX=testX, testY=testY):
# print(cross_val_score(model, trainX, trainY, scoring='accuracy', cv=10))
accuracy = cross_val_score(model, trainX, trainY,
scoring='accuracy', cv=2).mean() * 100
model.fit(trainX, trainY)
testAccuracy = model.score(testX, testY)
print("Training accuracy of "+model_name+" is: ", accuracy)
print("Testing accuracy of "+model_name+" is: ", testAccuracy*100)
print('\n')
# run(rf_class,'log')
model = log_class
model.fit(trainX,trainY)
perm = PermutationImportance(model, random_state=1).fit(testX, testY)
eli5.show_weights(perm, feature_names=feature_names)
Is my dataset simply too small to train on? Are the dummies too much for the models? Any help that can be offered is greatly appreciated.

batching huge data in tensorflow

I am trying to perform binary classification using the code/tutorial from
https://github.com/eisenjulian/nlp_estimator_tutorial/blob/master/nlp_estimators.py
print("Loading data...")
(x_train_variable, y_train), (x_test_variable, y_test) = imdb.load_data(num_words=vocab_size)
print(len(y_train), "train sequences")
print(len(y_test), "test sequences")
print("Pad sequences (samples x time)")
x_train = sequence.pad_sequences(x_train_variable,
maxlen=sentence_size,
padding='post',
value=0)
x_test = sequence.pad_sequences(x_test_variable,
maxlen=sentence_size,
padding='post',
value=0)
print("x_train shape:", x_train.shape)
print("x_test shape:", x_test.shape)
def train_input_fn():
dataset = tf.data.Dataset.from_tensor_slices((x_train, x_len_train, y_train))
dataset = dataset.shuffle(buffer_size=len(x_train_variable))
dataset = dataset.batch(100)
dataset = dataset.map(parser)
dataset = dataset.repeat()
iterator = dataset.make_one_shot_iterator()
return iterator.get_next()
def eval_input_fn():
dataset = tf.data.Dataset.from_tensor_slices((x_test, x_len_test, y_test))
dataset = dataset.batch(100)
dataset = dataset.map(parser)
iterator = dataset.make_one_shot_iterator()
return iterator.get_next()
def cnn_model_fn(features, labels, mode, params):
input_layer = tf.contrib.layers.embed_sequence(
features['x'], vocab_size, embedding_size,
initializer=params['embedding_initializer'])
training = mode == tf.estimator.ModeKeys.TRAIN
dropout_emb = tf.layers.dropout(inputs=input_layer,
rate=0.2,
training=training)
conv = tf.layers.conv1d(
inputs=dropout_emb,
filters=32,
kernel_size=3,
padding="same",
activation=tf.nn.relu)
# Global Max Pooling
pool = tf.reduce_max(input_tensor=conv, axis=1)
hidden = tf.layers.dense(inputs=pool, units=250, activation=tf.nn.relu)
dropout_hidden = tf.layers.dropout(inputs=hidden,
rate=0.2,
training=training)
logits = tf.layers.dense(inputs=dropout_hidden, units=1)
# This will be None when predicting
if labels is not None:
labels = tf.reshape(labels, [-1, 1])
optimizer = tf.train.AdamOptimizer()
def _train_op_fn(loss):
return optimizer.minimize(
loss=loss,
global_step=tf.train.get_global_step())
return head.create_estimator_spec(
features=features,
labels=labels,
mode=mode,
logits=logits,
train_op_fn=_train_op_fn)
cnn_classifier = tf.estimator.Estimator(model_fn=cnn_model_fn,
model_dir=os.path.join(model_dir, 'cnn'),
params=params)
train_and_evaluate(cnn_classifier)
The example here loads data from IMDB movie reviews. I have my own dataset in the form of text which is approx 2GB huge. Now in this example the line
(x_train_variable, y_train), (x_test_variable, y_test) = imdb.load_data(num_words=vocab_size) tries to load whole dataset in memory. If I try to do the same I run out of memory. How can I restructure this logic to read data in batches from my disk?
You want to change the dataset = tf.data.Dataset.from_tensor_slices((x_train, x_len_train, y_train)) line. There are lots of ways of creating a dataset - from_tensor_slices is the easiest, but won't work on its own if you can't load the entire dataset to memory.
The best way depends on how you have the data stored, or how you want to store it/manipulate it. The simplest in my opinion with very little down-side (unless running on multiple GPUs) is to have the original dataset just give indices to data, and write a normal numpy function for loading the ith example.
dataset = tf.data.Dataset.from_tensor_slices(tf.range(epoch_size))
def tf_map_fn(i):
def np_map_fn(i):
return load_ith_example(i)
inp1, inp2 = tf.py_func(np_map_fn, (i,), Tout=(tf.float32, tf.float32), stateful=False)
# other preprocessing/data augmentation goes here.
# unbatched sizes
inp1.set_shape(shape1)
inp2.set_shape(shape2)
return inp1, inp2
dataset = dataset.repeat().shuffle(epoch_size).map(tf_map_fn, 8)
dataset = dataset.batch(batch_size)
dataset = dataset.prefetch(1) # start loading data as GPU trains on previous batch
inp1, inp2 = dataset.make_one_shot_iterator().get_next()
Here I assume your outputs are float32 tensors (Tout=...). set_shape calls aren't strictly necessary, but if you know the shape it'll do better error checks.
So long as your preprocessing doesn't take longer than your network to run, this should run just as fast as any other method on a single GPU machine.
The other obvious way is to convert your data to tfrecords, but that'll take up more space on disk and is more of a pain to manage if you ask me.

Resources