I have created a dataset for human activity recognition with accelerometer data (x,y,z) and HR (bpm). I have extracted the data into 2.5 seconds (20hz , equal to 50 samples), but there is class imbalance which I would like to balance with a method such as SMOTE. The problem is that I have not found a way to do this without corrupting the samples.
The shape is: (Y, 50,4) where Y is of arbitrary length.
That means that every new sample x has to have the same (x, 50, 4) shape.
The dataset will be used in training a CNN-LSTM model.
I can't do the over-and under-sampling by using reshape(-1,4) beforehand and then back to the original shape. This will corrupt the segments of length 50.
Any idea how this can be done, preferably with implemented libraries such as scikit-learn or imbalanced-learn?
Or is the best approach to implement class_weights when trainin the model in keras?
Related
I have 1D data (on column data). I used Gaussian Mixture Model (GMM) as a density estimation, using this implementation in Python: https://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html. By relying on AIC/BIC criteron i was able to determine number of components. After i fit the GMM, i plotted kernel density estimation of original observation + that of sampled data drawn from GMM. the plot of original and sampled desnities are quiet similar( that is good). But, i would like some metrics to report how good is the fitted model.
g = GaussianMixture(n_components = 35)
data= df['x'].values.reshape(-1,1) # data taken from data frame (10,000 data pints)
clf= g.fit(data)# fit model
samples= clf.sample(10000)[0] # generate sample data points (same # as original data points)
I found score in the implementation, but not sure how to implememnt. Am i doing it wrong? or is there any better way to show how accuracy is the fitted model, apart from histogram or kernel densities plots?.
print(clf.score(data))
print(clf.score(samples))
You can use normalized_mutual_info_score, adjusted_rand_score or silhouette score to evaluate your clusters. All of these metrics are implemented under sklearn.metrics section.
EDIT: You can check this link for more detail explanations.
In a summary:
Adjusted Rand Index: measures the similarity of the two assignments.
Normalized Mutual Information: measures the agreement of the two assignments.
Silhouette Coefficient: measures how well-assigned each individual point is.
gmm.fit(x_vec)
pred = gmm.predict(x_vec)
print ("gmm: silhouttte: ", silhouette_score(x_vec, pred))
I would better use cross-validation and try to see the accuracy of the trained model.
Use the predict method of the fitted model to predict the labels of unseen data (use cross-validation and report the acurracy): https://scikit-learn.org/stable/modules/generated/sklearn.mixture.GaussianMixture.html#sklearn.mixture.GaussianMixture.predict
Toy example:
g = GaussianMixture(n_components = 35)
g.fit(train_data)# fit model
y_pred = g.predict(test_data)
EDIT:
There are several options to measure the performance of your unsupervised case. For GMM, which base on real probabilities, the most common are BIC and AIC. They are immediatly included in the scikit GMM class.
I am training a CNN model(made using Keras). Input image data has around 10200 images. There are 120 classes to be classified. Plotting the data frequency, I can see that sample data for every class is more or less uniform in terms of distribution.
Problem I am facing is loss plot for training data goes down with epochs but for validation data it first falls and then goes on increasing. Accuracy plot reflects this. Accuracy for training data finally settles down at .94 but for validation data its around 0.08.
Basically its case of over fitting.
I am using learning rate of 0.005 and dropout of .25.
What measures can I take to get better accuracy for validation? Is it possible that sample size for each class is too small and I may need data augmentation to have more data points?
Hard to say what could be the reason. First you can try classical regularization techniques like reducing the size of your model, adding dropout or l2/l1-regularizers to the layers. But this is more like randomly guessing the models hyperparameters and hoping for the best.
The scientific approach would be to look at the outputs for your model and try to understand why it produces these outputs and obviously checking your pipeline. Did you had a look at the outputs (are they all the same)? Did you preprocess the validation data the same way as the training data? Did you made a stratified train/test-split, i.e. keeping the class distribution the same in both sets? Is the data shuffles when you feed it to your model?
In the end you have about ~85 images per class which is really not a lot, compare CIFAR-10 resp. CIFAR-100 with 6000/600 images per class or ImageNet with 20k classes and 14M images (~500 images per class). So data augmentation could be beneficial as well.
I have a sequence of multi-band images, say each sample is a tensor of size (50, 6, 30, 30) where 50 is the number of image frames in sequence, 6 is number of bands per pixel, and 30x30 is the spatial dimension of the image. The ground truth map is of size 30x30, but it is one-hot encoded (to use crossentropy loss) o 7 classes, so it is a tensor of size (1, 7, 30, 30).I want to use a combination of convolutional and LSTM (or use an integrated ConvLSTM2D layer) for my classification task, but there are below problems:
1- Not every point has a valid label at the output map (i.e. some one-hot vectors are all-zero),
2- Not every pixel has a valid value in every time stamp. So, at every given time stamp, some of the pixels may have zero value (means invalid) for all of their band values.
I read many Q&As on how to handle this issue and I think I should use sample_weights option to mask the invalid points and classes but I am really uncertain how to do it. Sample_weights should be applied to every pixel and each timestamp independently. I think I can manage it if I didn't have the convolution part (a 2D approach). But don't understand how it works when convolution is in place, because some pixel values in convolution window are valid and some are invalid.If I mask those invalid pixels at a specific time (that still I don't know how to do it), what will happen to the chain of forward and backward propagation and loss calculation? I think it will be ruined!
Looking for comments and help.
Possible solution:
Problem 1- For pixels where do not have class at all you can introduce a new class with a label for example noise,
it means not in your one hot encode you have value for that as well and weights will be generated accordingly for those pixels for noise class
this is an indirect way to achieve the same thing you do with sample weight
cause in the sample_weight technique you tell keras or sklearn that what is the weightage of the parameter or sample ratio of the weights.
Problem 2- To answer part 2 consider the possible use cases for example for these invalid values class value can be there in hot encode vector or it will be all zeros?
or you can preprocess and add these to the noise class as well then point 2 will be handled by point 1 automatically.
I am trying to build a CNN (in Keras) that can estimate the rotation of an image (or a 2d object). So basically, the input is an image and the output should be its rotation.
My first experiment is to estimate the rotation of MŃIST digits (starting with only one digit "class", let's say the "3"). So what I did was extracting all 3s from the MNIST set, and then building a "rotated 3s" dataset, by randomly rotating these images multiple times, and storing the rotated images together with their rotation angles as ground truth labels.
So my first problem was that a 2d rotation is cyclic and I didn't know how to model this behavior. Therefore, I encoded the angle as y=sin(ang), x = cos(ang). This gives me my dataset (the rotated 3s images) and the corresponding labels (x and y values).
For the CNN, as a start, i just took the keras MNIST CNN example (https://keras.io/examples/mnist_cnn/) and replaced the last dense layer (that had 10 outputs and a softmax activation) with a dense layer that has 2 outputs (x and y) and a tanh activation (since y=sin(ang), x = cos(ang) are within [-1,1]).
The last thing i had to decide was the loss function, where i basically want to have a distance measurement for angles. Therefore i thought "cosine_proximity" is the way to go.
When training the network I can see that the loss is decreasing and converging to a certain point. However when I then check the predictions vs the ground truth I observe a (for me) fairly surprising behavior. Almost all x and y predictions tend towards 0 or +/-1. And since the "decoding" of my rotation is ang=atan2(y,x) the predictions are usually either +/- 0°, 45°, 90, 135° or 180°.
However, my training and test data has only angles of 0°, 20°, 40°, ... 360°.
This doesn't really change if I change the complexity of the network. I also played around with the optimizer parameters without any success.
Is there anything wrong with the assumptions:
- x,y encoding for angle
- tanh activation to have values in [-1,1]
- cosine_proximity as loss function
Thanks in advance for any advice, tips or pointing me towards a possible mistake i made!
It's hard to give you an exact answer so let's try with some ideas:
Change from Cosine Proximity to MSE or other losses and check if something changes.
Change the way you encode the target. You could just represent the angle as a number between 0 and 1. It doesn't seem a problem even if the angles are ciclic.
Ensure you preprocessing/augmentation steps make sense for this particular task.
Using Keras for image segmentation on a highly imbalanced dataset, and I want to re-weight the classes proportional to pixels values in each class as described here. If a have binary classes with weights = [0.8, 0.2], how can I modify K.sparse_categorical_crossentropy(y_true, y_pred) to re-weight the loss according to the class which the pixel belongs to?
The input has shape (4, 256, 256, 1) (batch, height, width, channels) and the output is a vector of 0's and 1's (4, 65536, 1) (positive and negative class). The model and data is similar to the one here with the difference being the images are grayscale and the masks are binary (2 classes).
This is the custom loss function I used for my semantic segmentation project. It is modified from the categorical_crossentropy function found in keras/backend/tensorflow_backend.py.
def class_weighted_pixelwise_crossentropy(target, output):
output = tf.clip_by_value(output, 10e-8, 1.-10e-8)
weights = [0.8, 0.2]
return -tf.reduce_sum(target * weights * tf.log(output))
Note that my final version did not use class weighting - I found that it encouraged the model to use the underrepresented classes as filler for patches of the image that it was unsure about instead of making more realistic guesses, and thereby hurt performance.
Jessica's answer is clean and works well. I generally recommend it. But for the sake of variety:
I have found that sampling regions of interest that include a better ratio between the classes is an effective way to quickly learn skewed pixelwise classes.
In my case, I had two classes like you which makes things easier. I look for areas in the image that have appearances of the less represented class. I crop around it with some random offset a constantly sized bounding box ( i repeat the process multiple times per image). This yields a large set of small images that have fairly equal ratios of each class.
I should probably add here that the network will have to be set to input shape of (None, None, num_chanals) for this to then work on your original images.
Because you skip out on the vast majority of pixels ( that belong to the majority class) the training is very fast but doesn't leverage all of the data for the majority class.
In tensorflow 2.x the model.fit method has a class_weight argument to do this natively, passing a dictionary of weights for each class. Documentation