How to oversample image dataset using Python? - python-3.x

I am working on a multiclass classification problem with an unbalanced dataset of images(different class). I tried imblearn library, but it is not working on the image dataset.
I have a dataset of images belonging to 3 class namely A,B,C. A has 1000 data, B has 300 and C has 100. I want to oversample class B and C, so that I can avoid data imbalance. Please let me know how to oversample the image dataset using python.

Actually, it seems imblearn.over_sampling resampling just 2d dims inputs. So one way to oversampling your image dataset by this library is to use reshaping alongside with it, you can:
reshape your images
oversample them
again reshape the new dataset to
the first dims
consider you have an image dataset of size (5000, 28, 28, 3) and dtype of nd.array, following the above instructions, you can use the solution below:
# X : current_dataset
# y : labels
from imblearn.over_sampling import RandomOverSampler
reshaped_X = X.reshape(X.shape[0],-1)
#oversampling
oversample = RandomOverSampler()
oversampled_X, oversampled_y = oversample.fit_resample(reshaped_X , y)
# reshaping X back to the first dims
new_X = oversampled_X.reshape(-1,28,28,3)
hope that was helpful!

Related

Keras data augmentaion changes pixel values for masks (segmentation)

Iam using runtime data augmentation using generators in keras for segmentation problem..
Here is my data generator
data_gen_args = dict(
width_shift_range=0.1,
height_shift_range=0.1,
zoom_range=0.2,
horizontal_flip=True,
validation_split=0.2
)
image_datagen = ImageDataGenerator(**data_gen_args)
def generate_data_generator(generator, Xi, Yi):
genXi = generator.flow(Xi, seed=7, batch_size=32)
genYi = generator.flow(Yi, seed=7,batch_size=32)
while True:
Xi = genXi.next()
Yi = genYi.next()
print(Yi.dtype)
print(np.unique(Yi))
yield (Xi, Yi)
train_generator = generate_data_generator(image_datagen,
x_train,
y_train)
My labels are in a numpy array with data type float 32 and value 0.0 and 1.0.
#Output of np.unique(y_train)
array([0., 1.], dtype=float32
However, the data generator seems to modifies pixel values as shown below:-
#Output of print(np.unique(Yi))
[0.00000000e+00 1.01742386e-04 1.74021334e-04 ... 9.99918878e-01
9.99988437e-01 1.00000000e+00]
It is supposed to have same values(0.0 and 1.0) after data geneartion..
Also, the the official documentation shows an example using same augmentation arguments for generating mask and images together.
However when i remove shift and zoom iam getting (0.0 and 1.0) as output.
Keras verion 2.2.4,Python 3.6.8
UPDATE:-
I saved those images as numpy array and plotted it using matplotlib.It looks like the edges are smoothly interpolated (0.0-1.0) somehow upon including shifts and zoom augmentation. I can round these values in my custom generator as a hack; but i still don't understand the root cause (in case of normal images this is quite unnoticeable and has no adverse effects; but in masks we don't want to change label values )!!!
Still wondering.. is this a bug (nobody has mentioned it so far)or problem with my custom code ??

Add channel to MNIST via transform?

I'm trying to use the MNIST dataset from torchvision.datasets.It seems to be provided as an N x H x W (uint8) (batch dimension, height, width) tensor. All the pytorch classes for working on images (for instance Conv2d) however require a N x C x H x W (float32) tensor where C is the number of colour channels. I've tried to add add the ToTensor transform but that didn't add a color channel.
Is there a way using torchvision.transforms to add this additional dimension? For a raw tensor we could just do .unsqueeze(1) but that doesn't look like a very elegant solution. I'm just trying to do it the "proper" way.
Here is the failed conversion.
import torchvision
dataset = torchvision.datasets.MNIST("~/PyTorchDatasets/MNIST/", train=True, transform=torchvision.transforms.ToTensor(), download=True)
print(dataset.train_data[0])
I had a misconception: dataset.train_data is not affected by the specified transform, only the output of a DataLoader(dataset,...) will be. After checking data from
for data, _ in DataLoader(dataset):
break
we can see that ToTensor actually does exactly what is desired.

Invalid dimension for image data in plt.imshow()

I am using mnist dataset for training a capsule network in keras background.
After training, I want to display an image from mnist dataset. For loading images, mnist.load_data() is used. The data is stored as (x_train, y_train),(x_test, y_test).
Now, for visualizing image, my code is as follows:
img_path = x_test[1]
print(img_path.shape)
plt.imshow(img_path)
plt.show()
The code gives output as follows:
(28, 28, 1)
and the error on plt.imshow(img_path) as follows:
TypeError: Invalid dimensions for image data
How to show image in png format. Help!
As per the comment of #sdcbr using np.sqeeze reduces unnecessary dimension. If image is 2 dimensions then imshow function works fine. If image has 3 dimensions then you have to reduce extra 1 dimension. But, for higher dim data you will have to reduce it to 2 dims, so np.sqeeze may be applied multiple times. (Or you may use some other dim reduction functions for higher dim data)
import numpy as np
import matplotlib.pyplot as plt
img_path = x_test[1]
print(img_path.shape)
if(len(img_path.shape) == 3):
plt.imshow(np.squeeze(img_path))
elif(len(img_path.shape) == 2):
plt.imshow(img_path)
else:
print("Higher dimensional data")
Example:
plt.imshow(test_images[0])
TypeError: Invalid shape (28, 28, 1) for image data
Correction:
plt.imshow((tf.squeeze(test_images[0])))
Number 7
You can use tf.squeeze for removing dimensions of size 1 from the shape of a tensor.
plt.imshow( tf.shape( tf.squeeze(x_train) ) )
Check out TF2.0 example
matplotlib.pyplot.imshow() does not support images of shape (h, w, 1). Just remove the last dimension of the image by reshaping the image to (h, w): newimage = reshape(img,(h,w)).

How do I use Theanets LSTM RNN's on my time series data?

I have a simple dataframe consisting of one column. In that column are 10320 observations (numerical). I'm simulating Time-Series data by inserting the data into a plot with a window of 200 observations each. Here is the code for plotting.
import matplotlib.pyplot as plt
from IPython import display
fig_size = plt.rcParams["figure.figsize"]
import time
from matplotlib.backends.backend_agg import FigureCanvasAgg as FigureCanvas
fig, axes = plt.subplots(1,1, figsize=(19,5))
df = dframe.set_index(arange(0,len(dframe)))
std = dframe[0].std() * 6
window = 200
iterations = int(len(dframe)/window)
i = 0
dframe = dframe.set_index(arange(0,len(dframe)))
while i< iterations:
frm = window*i
if i == iterations:
to = len(dframe)
else:
to = frm+window
df = dframe[frm : to]
if len(df) > 100:
df = df.set_index(arange(0,len(df)))
plt.gca().cla()
plt.plot(df.index, df[0])
plt.axhline(y=std, xmin=0, xmax=len(df[0]),c='gray',linestyle='--',lw = 2, hold=None)
plt.axhline(y=-std , xmin=0, xmax=len(df[0]),c='gray',linestyle='--', lw = 2, hold=None)
plt.ylim(min(dframe[0])- 0.5 , max(dframe[0]) )
plt.xlim(-50,window+50)
display.clear_output(wait=True)
display.display(plt.gcf())
canvas = FigureCanvas(fig)
canvas.print_figure('fig.png', dpi=72, bbox_inches='tight')
i += 1
plt.close()
This simulates a flow of real-time data and visualizes it. What I want is to apply theanets RNN LSTM to the data to detect anomalies unsupervised. Because I am doing it unsupervised I don't think that I need to split my data into training and test sets. I haven't found much of anything that makes sense to me so far and have been googling for about 2 hours. Just hoping that you guys may be able to help. I want to put the prediction output of the RNN on the graph as well and define a threshold that, if the error is too large, the values will be identified as anomalous. If you need more information please comment and let me know. Thank you!
READING
Like neurons, LSTM networks are build of interconnected LSTM Blocks whose training is done via BackPropogation Through Time.
Classical anomaly detection using time series required prediction of time series output in future (at one or more points) and finding error on these points with true values. Prediction Error above a threshold will reflect and amomly
SOLUTION
Having said this
You've to train network so you need training sets and test sets both
Use N inputs to predict M outputs (decide upon N and M with experimentation - values for which training error is low)
Scroll a window of (N+M) elements in input data and use this data array of (N+M) items also termed as frame to train or test network.
Typically we use 90% of starting series for training and 10% for testing.
This scheme will fail as if training is not proper there will be false prediction errors which are not-anomaly. So make sure to provide enough training, and most important shuffle training frames and consider all variations.

Role of class_weight in loss functions for linearSVC and LogisticRegression

I am trying to figure out what exactly the loss function formula is and how I can manually calculate it when class_weight='auto' in case of svm.svc, svm.linearSVC and linear_model.LogisticRegression.
For balanced data, say you have a trained classifier: clf_c. Logistic loss should be (am I correct?):
def logistic_loss(x,y,w,b,b0):
'''
x: nxp data matrix where n is number of data points and p is number of features.
y: nx1 vector of true labels (-1 or 1).
w: nx1 vector of weights (vector of 1./n for balanced data).
b: px1 vector of feature weights.
b0: intercept.
'''
s = y
if 0 in np.unique(y):
print 'yes'
s = 2. * y - 1
l = np.dot(w, np.log(1 + np.exp(-s * (np.dot(x, np.squeeze(b)) + b0))))
return l
I realized that logisticRegression has predict_log_proba() which gives you exactly that when data is balanced:
b, b0 = clf_c.coef_, clf_c.intercept_
w = np.ones(len(y))/len(y)
-(clf_c.predict_log_proba(x[xrange(len(x)), np.floor((y+1)/2).astype(np.int8)]).mean() == logistic_loss(x,y,w,b,b0)
Note, np.floor((y+1)/2).astype(np.int8) simply maps y=(-1,1) to y=(0,1).
But this does not work when data is imbalanced.
What's more, you expect the classifier (here, logisticRegression) to perform similarly (in terms of loss function value) when data in balance and class_weight=None versus when data is imbalanced and class_weight='auto'. I need to have a way to calculate the loss function (without the regularization term) for both scenarios and compare them.
In short, what does class_weight = 'auto' exactly mean? Does it mean class_weight = {-1 : (y==1).sum()/(y==-1).sum() , 1 : 1.} or rather class_weight = {-1 : 1./(y==-1).sum() , 1 : 1./(y==1).sum()}?
Any help is much much appreciated. I tried going through the source code, but I am not a programmer and I am stuck.
Thanks a lot in advance.
class_weight heuristics
I am a bit puzzled by your first proposition for the class_weight='auto' heuristic, as:
class_weight = {-1 : (y == 1).sum() / (y == -1).sum(),
1 : 1.}
is the same as your second proposition if we normalize it so that the weights sum to one.
Anyway to understand what class_weight="auto" does, see this question:
what is the difference between class weight = none and auto in svm scikit learn.
I am copying it here for later comparison:
This means that each class you have (in classes) gets a weight equal
to 1 divided by the number of times that class appears in your data
(y), so classes that appear more often will get lower weights. This is
then further divided by the mean of all the inverse class frequencies.
Note how this is not completely obvious ;).
This heuristic is deprecated and will be removed in 0.18. It will be replaced by another heuristic, class_weight='balanced'.
The 'balanced' heuristic weighs classes proportionally to the inverse of their frequency.
From the docs:
The "balanced" mode uses the values of y to automatically adjust
weights inversely proportional to class frequencies in the input data:
n_samples / (n_classes * np.bincount(y)).
np.bincount(y) is an array with the element i being the count of class i samples.
Here's a bit of code to compare the two:
import numpy as np
from sklearn.datasets import make_classification
from sklearn.utils import compute_class_weight
n_classes = 3
n_samples = 1000
X, y = make_classification(n_samples=n_samples, n_features=20, n_informative=10,
n_classes=n_classes, weights=[0.05, 0.4, 0.55])
print("Count of samples per class: ", np.bincount(y))
balanced_weights = n_samples /(n_classes * np.bincount(y))
# Equivalent to the following, using version 0.17+:
# compute_class_weight("balanced", [0, 1, 2], y)
print("Balanced weights: ", balanced_weights)
print("'auto' weights: ", compute_class_weight("auto", [0, 1, 2], y))
Output:
Count of samples per class: [ 57 396 547]
Balanced weights: [ 5.84795322 0.84175084 0.60938452]
'auto' weights: [ 2.40356854 0.3459682 0.25046327]
The loss functions
Now the real question is: how are these weights used to train the classifier?
I don't have a thorough answer here unfortunately.
For SVC and linearSVC the docstring is pretty clear
Set the parameter C of class i to class_weight[i]*C for SVC.
So high weights mean less regularization for the class and a higher incentive for the svm to classify it properly.
I do not know how they work with logistic regression. I'll try to look into it but most of the code is in liblinear or libsvm and I'm not too familiar with those.
However, note that the weights in class_weight do not influence directly methods such as predict_proba. They change its ouput because the classifier optimizes a different loss function.
Not sure this is clear, so here's a snippet to explain what I mean (you need to run the first one for the imports and variable definition):
lr = LogisticRegression(class_weight="auto")
lr.fit(X, y)
# We get some probabilities...
print(lr.predict_proba(X))
new_lr = LogisticRegression(class_weight={0: 100, 1: 1, 2: 1})
new_lr.fit(X, y)
# We get different probabilities...
print(new_lr.predict_proba(X))
# Let's cheat a bit and hand-modify our new classifier.
new_lr.intercept_ = lr.intercept_.copy()
new_lr.coef_ = lr.coef_.copy()
# Now we get the SAME probabilities.
np.testing.assert_array_equal(new_lr.predict_proba(X), lr.predict_proba(X))
Hope this helps.

Resources