Probability Distribution of batch in keras - keras

I am trying to train a CNN model on imbalanced dataset. I wanted to know how well a batch approximates the distribution in the training dataset. Is there any parameter in an inbuilt function in keras which could be specified to maintain the same distribution in batches?

It's possible to train and get good results depending on how severe the imbalance is.
But yes, there are easy ways to compensate this, such as using sample_weight and class_weight in the fit method.
From the documentation on the fit method:
class_weight: Optional dictionary mapping class indices (integers) to a weight (float) value, used for weighting the loss function (during training only). This can be useful to tell the model to "pay more attention" to samples from an under-represented class.
sample_weight: Optional Numpy array of weights for the training samples, used for weighting the loss function (during training only). You can either pass a flat (1D) Numpy array with the same length as the input samples (1:1 mapping between weights and samples), or in the case of temporal data, you can pass a 2D array with shape (samples, sequence_length), to apply a different weight to every timestep of every sample. In this case you should make sure to specify sample_weight_mode="temporal" in compile().
So, you can compensate three kinds of imbalance:
Class imbalance: when you have classes (results) that are more present than others
Sample imbalance: when some of the inputs are more important than others
Temporal (or 2D) imbalance: when some steps in a sequence are more important than others

Related

Large dataset - ANN

I am trying to classify around 400K data with 13 attributes. I have used python sklearn's SVM package, but it didn't work, and then I learned that SVM's are not suitable for large dataset classification. Then I used the (sklearn) ANN using the following MLPClassifier:
MLPClassifier(solver='adam', alpha=1e-5, random_state=1,activation='relu', max_iter=500)
and trained the system using 200K samples, and tested the model on the remaining ones. The classification worked well. However, my concern is that the system is over trained or overfit. Can you please guide me on the number of hidden layers and node sizes to make sure that there is no overfit? (I have learned that the default implementation has 100 hidden neurons. Is it ok to use the default implementation as is?)
To know if your are overfitting you have to compute:
Training set accuracy
Test set accuracy
Once you have calculated this scores, compare it. If training set score is much better than your test set score, then you are overfitting. This means that your model is "memorizing" your data, instead of learning from it to make future predictions.
If you are overfitting with Neuronal Networks you probably have to reduce the number of layers and reduce the number of neurons per layer. There isn't any strict rule that says the number of layer or neurons you need depending on you dataset size. Every dataset can behaves completely different with the same dataset size.
So, to conclude, if you are overfitting, you would have to evaluate your model accuracy using different parameters of layers and number of neurons, and, then, observe with which values you obtain the best results. There are some methods you can use to find the best parameters, is like gridsearchCV.

How to calculate a One-Hot Encoding value into a real-valued vector?

In Word2Vec, i've learned that both of CBOW and Skip-gram produce a one-hot encoding value to create a vector (cmiiw), I wonder how to calculate or represents a One-Hot Encoding value into a real-valued vector, for example (source: DistrictDataLab's Blog about Distributed Representations)
from this:
into:
please help, I was struggling on finding this information.
The word2vec algorithm itself is what incrementally learns the real-valued vector, with varied dimension values.
In contrast to the one-hot encoding, these vectors are often called "dense embeddings". They're "dense" because unlike the one-hot encoding, which is "sparse" with many dimensions and mostly zero values, they have fewer dimensions and (usually) no zero-values. They're an "embedding" because they've "embed" a discrete set-of-words into another continuous-coordinate-system.
You'd want to read the original word2vec paper for a full formal description of how the dense embeddings are made.
But the gist is that the dense vectors start totally random, and so at first the algorithm's internal neural network is useless for predicting neighboring words. But each (context)->(target) word training example from a text corpus is tried against the network, and each time the difference from the desired prediction is used to apply a tiny nudge, towards a better prediction, to both word-vector and internal-network-weight values.
Repeated many times, initially with larger nudges (higher learning-rate) then with ever-smaller nudges, the dense vectors rearrange their coordinates from their initial randomness to a useful relative-arrangement – one that's about-as-good as possible for predicting the training text, given the limits of the model itself. (That is, any further nudge that improves predictions on some examples, worsens it on others – so you might as well consider training done.)
You then read the resulting dense embedding real-valued vectors out of the model, and use them for purposes other than just nearby-word prediction.

Re-weight the input to a neural network

For example, I feed a set of images into a CNN. And the default weight of these images is 1. How can I re-weight some of these images so that they have different weights? Can 'DataLoader' achieve this goal in pytorch?
I learned two other possibilities:
Defining a custom loss function, providing weights for each sample as I require.
Repeating samples in the training set, which will result in more frequent samples having a higher weight in the final loss.
Is there any other way, we can achieve that? Any suggestion would be appreciated.
I can think of two ways to achieve this.
Pass on the weight explicitly, when you backpropagate the gradients.
After you computed loss, and when you're about to backpropagate, you can pass a Tensor to backward() and all the subsequent gradients will be scaled by the corresponding element, i.e. do something like
loss = torch.pow(out-target,2)
loss.backward(my_weights) # is a Tensor of same dimension as loss
However, if you use want to assign individual weights in a batch, you can't use the custom loss functions from the nn.module which aggregates the loss over all samples in a batch.
Use torch.utils.data.sampler.WeightedRandomSampler
If you use PyTorch's data.utils anyway, this is simpler than multiplying your training set. However it doesn't assign exact weights, since it's stochastic. But if you're iterating over your training set a sufficient number of times, it's probably close enough.

Is there class weight (or alternative way) for GradientBoostingClassifier in Sklearn when dealing with VotingClassifier or Grid search?

I'm using GradientBoostingClassifier for my unbalanced labeled datasets. It seems like class weight doesn't exist as a parameter for this classifier in Sklearn. I see I can use sample_weight when fit but I cannot use it when I deal with VotingClassifier or GridSearch. Could someone help?
Currently there isn't a way to use class_weights for GB in sklearn.
Don't confuse this with sample_weight
Sample Weights change the loss function and your score that you're trying to optimize. This is often used in case of survey data where sampling approaches have gaps.
Class Weights are used to correct class imbalances as a proxy for over \ undersampling. There is no direct way to do that for GB in sklearn (you can do that in Random Forests though)
Very late, but I hope it can be useful for other members.
In the article of Zichen Wang in towardsdatascience.com, the point 5 Gradient Boosting it is told:
For instance, Gradient Boosting Machines (GBM) deals with class imbalance by constructing successive training sets based on incorrectly classified examples. It usually outperforms Random Forest on imbalanced dataset For instance, Gradient Boosting Machines (GBM) deals with class imbalance by constructing successive training sets based on incorrectly classified examples. It usually outperforms Random Forest on imbalanced dataset.
And a chart shows that the half of the grandient boosting model have an AUROC over 80%. So considering GB models performances and the way they are done, it seems not to be necessary to introduce a kind of class_weight parameter as it is the case for RandomForestClassifier in sklearn package.
In the book Introduction To Machine Learning with Pyhton written by Andreas C. Müller and Sarah Guido, edition 2017, page 89, Chapter 2 *Supervised Learning, section Ensembles of Decision Trees, sub-section Gradient boosted regression trees (gradient boosting machines):
They are generally a bit more sensitive to
parameter settings than random forests, but can provide better accuracy if the parameters are set correctly.
Now if you still have scoring problems due to imbalance proportions of categories in the target variable, it is possible you should see if your data should be splited to apply different models on it, because they are not as homogeneous as it seems to be. I mean it may have a variable you have not in your dataset train (an hidden variable clearly) that influences a lot the model results, then it is difficult even for the greater GB to give correct scoring because it misses a huge information that you cannot make appear in the matrix to compute sometimes for many reasons.
Some updates:
I found, by random, there are libraries that implement it as parameters of their gradient boosting instance objects. It is the case of H2O where for the parameter balance_classes it is told:
Balance training data class counts via over/under-sampling (for
imbalanced data).
Type: bool (default: False).
If you want to keep with sklearn you should do as HakunaMaData told: over/under-sampling because that's what other libraries finally do when the parameter exist.

predict() returns image similarities with SVM in scikit learn

A silly question: after i train my SVM in scikit-learn i have to use predict function: predict(X) for predicting at which class belongs? (http://scikit-learn.org/dev/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC.predict)
X parameter is the image feature vector?
In case i give an image not trained (not trained because SVM ask at least 3 samples for class), what returns?
First remark: "predict() returns image similarities with SVM in scikit learn" is not a question. Please put a question in the header of Stack Overflow entries.
Second remark: the predict method of the SVC class in sklearn does not return "image similarities" but a class assignment prediction. Read the http://scikit-learn.org documentation and tutorials to understand what we mean by classification and prediction in machine learning.
X parameter is the image feature vector?
No, X is not "the image" feature vector: it is a set of image feature vectors with shape (n_samples, n_features) as explained in the documentation you refer to. In your case a sample is an image hence the expected shape would be (n_images, n_features). The predict API was design to compute many predictions at once for efficiency reason. If you want to compute a single prediction, you will have to wrap your single feature vector in an array with shape (1, n_features).
For instance if you have a single feature vector (1D) called my_single_image_features with shape (n_features,) you can call predict with:
predictions = clf.predict([my_single_image_features])
my_single_prediction = predictions[0]
Please note the [] signs around the my_single_image_features variable to turn it into a 2D array.
my_single_prediction will be an integer whose meaning depends on the integer values provided by you when calling the clf.fit(X_train, y_train) method in the first place.
In case i give an image not trained (not trained because SVM ask at least 3 samples for class), what returns?
An image is not "trained". Only the model is trained. Of course you can pass samples / images that are not part of the training set to the predict method. This is the whole purpose of machine learning: making predictions on new unseen data based on what you learn from the statistical regularities seen in the past training data.

Resources