Sklearn Random Forrest different accuracy values for different label encodings - scikit-learn

I'm using sklearn Random Forrest to train my model. With the same input features for the model I tried passing the target labels first with label_binarize to create one hot encodings of my target labels and second I tried using label_encoder to encode my target labels. In both cases I'm getting different accuracy score. Is there a specific reason why this is happening, as I'm just using a different method to encode the labels without changing any input features.

It is not because of label, but the randomness of Random Forest.
Try fix the random_state to avoid this situation.

https://datascience.stackexchange.com/questions/74364/random-forrest-sklearn-gives-different-accuracy-for-different-target-label-encod
Basically when you encode your target labels as one hot encoding sklearn treats it as a multilabel problem as compared to label encoder which gives an 1d array where sklearn treats it as a multiclass problem.
https://scikit-learn.org/stable/modules/multiclass.html

Related

Tensorflow Keras: Problems to handle variable length input, using generator?

We want to train our model on varying input dimensions. Every input in a given batch and across batches has different dimensions.
We cannot resize our input (since we’ll lose our microscopic features). Now, since we cannot resize our input, converting them into batches of numpy array becomes impossible. In order to handle this now I have made the list for the input and each list of element contained (height, width, 1). Height is variable size and width is constant.
Sometime my input excessively large. In order to do that I have plan to use model.fit_generator(). In this, We find the max height and width of input in a batch and pad every other input with zeros so that every input in the batch has an equal dimension. Now we can easily convert it to a numpy array or a tensor and pass it to the fit_generator(). The model automatically learns to ignore the zeros and learns features from the intended portion from the padded input. This way we have a batch with equal input dimensions but every batch has a different shape (due to difference in max height and width of input across batches).
Now until here, I described the things what I have learned and what I have plan to do with variable input data. But I am stuck with the following confusions:
1- I have plan to use CNN first and then LSTM on that. I am using tensorflow keras. There, we have the facility of padding and masking . However, As for as I know that LSTM can work on masking and padding ignore 0-padded values. However, I am concerned about the CNN (does CNN ignores 0-padded values), because my padded input will first feed to CNN. I have seen some discussion in the following links:
How to apply masking layer to sequential CNN model in Keras?
https://github.com/keras-team/keras/issues/411
In these link, they mentioned that Unfortunately masking is not yet supported by the Keras Conv layers. However, now we can see alot of development and advancements specifically in the form of tensorflow Keras. So I am wondering that now tensorflow keras can support masking input?
2- To use the generator, we can use custom keras generator. For that I went through a vary good tutorial. I made the mind to use this. But I am wondering is there any advance built-in facility in tensorflow keras to use generator and save me to write custom keras generator?

What steps should I take next to improve my accuracy? Can data be the problem?

I built various ML models using sklearn for a binary classification problem. The data-set is provided to me by my professor for this comparative study.
my jupyter notebook and dataset can be found here
As I am getting very low accuracy, I fear that I must be doing something wrong while building the model. So I tested my decision tree on the inbuilt data-set in sklearn (breast cancer data-set) which is very similar to my data-set as both are binary classifications. Here I get an mean accuracy of 95 %. So I think right now that the problem might be my data-set. Can I get some help on how do I pre-process my data or any other steps that I might look into to improve accuracy.
Encode labels
Categorical data are variables that contain label values rather than numeric values.The number of possible values is often limited to a fixed set.
For example, users are typically described by country, gender, age group etc. We will use Label Encoder to label the categorical data. Label Encoder is the part of SciKit Learn library in Python and used to convert categorical data, or text data, into numbers, which our predictive models can better understand.
#Encoding categorical data values
from sklearn.preprocessing import LabelEncoder
labelencoder_Y = LabelEncoder()
Y = labelencoder_Y.fit_transform(Y)
Feature scaling
Most of the times, your dataset will contain features highly varying in magnitudes, units and range. But since, most of the machine learning algorithms use Eucledian distance between two data points in their computations. We need to bring all features to the same level of magnitudes. This can be achieved by scaling. This means that you’re transforming your data so that it fits within a specific scale, like 0–100 or 0–1. We will use StandardScaler method from SciKit-Learn library.
#Feature Scalingfrom sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
Choosing Right model
You kight also want to vhoose the appropriate model. You can't just use neural nets or so for all problems it's the no free luch theorem. For this you could use K-fold cross validation, AIC and BIC

How to ignore some input layer, while predicting, in a keras model trained with multiple input layers?

I'm working with neural networks and I've implemented the following architecture using keras with tensorflow backend:
For training, I'll give some labels in the layer labels_vector, this vector can have int32 values (ie: 0 could be a label). For the testing phase, I need to just ignore this input layer, if I set it to 0 results could be wrong since I've trained with labels that can be equal to 0 vector. Is there a way to simply ignore or disable this layer on the prediction phase?
Thanks in advance.
How to ignore some input layer ?
You can't. Keras cannot just ignore an input layer as the output depends on it.
One solution to get nearly what you want is to define a custom label in your training data to be the null value. Your network will learn to ignore it if it feels that it is not an important feature.
If labels_vector is a vector of categorical labels, use one-hot encoding instead of integer encoding. integer encoding assumes that there is a natural ordered relationship between each label which is wrong.

Scikit classification on categorical variables - feature importance and hot encoding -which first?

I have a dataframe comprising 23 categorical varables. I would eventually like to build a predictor model (Decision tree/Random forest) to predict if someone will attend an interview or not. This is my target variable. I will use Scikit for this task.
Questions:
As these are categorical variables am I right in saying I need to Hot encode each of my 23 categorical variables before splitting into a train, test and validation sets?
I have also been told to use Feature importance, but I am unsure if I use this before or after Hot encoding? It was my understanding that feature importance will help to reduce the number of features I have to Hot encode, in other words I use feature importance before Hot encoding.
However, the RandomForestClassifier(), I attempted to use for feature importance will not work with strings:
Input: forest = RandomForestClassifier(n_estimators=250, random_state=0)
forest.fit(X, y)
Output: ValueError: could not convert string to float: 'Single'
What would be the best way to go about this please?

Probability Distribution of batch in keras

I am trying to train a CNN model on imbalanced dataset. I wanted to know how well a batch approximates the distribution in the training dataset. Is there any parameter in an inbuilt function in keras which could be specified to maintain the same distribution in batches?
It's possible to train and get good results depending on how severe the imbalance is.
But yes, there are easy ways to compensate this, such as using sample_weight and class_weight in the fit method.
From the documentation on the fit method:
class_weight: Optional dictionary mapping class indices (integers) to a weight (float) value, used for weighting the loss function (during training only). This can be useful to tell the model to "pay more attention" to samples from an under-represented class.
sample_weight: Optional Numpy array of weights for the training samples, used for weighting the loss function (during training only). You can either pass a flat (1D) Numpy array with the same length as the input samples (1:1 mapping between weights and samples), or in the case of temporal data, you can pass a 2D array with shape (samples, sequence_length), to apply a different weight to every timestep of every sample. In this case you should make sure to specify sample_weight_mode="temporal" in compile().
So, you can compensate three kinds of imbalance:
Class imbalance: when you have classes (results) that are more present than others
Sample imbalance: when some of the inputs are more important than others
Temporal (or 2D) imbalance: when some steps in a sequence are more important than others

Resources