Testing on categorical variables Using sk-learn - scikit-learn

I'm trying to generate and test model a on categorical variables using sci-kit learn. I'm interested in using the one-hot encoder function to encode these categorical variables in a sklearn pipeline after imputation of the data and before a random forest.
estimator = Pipeline([
("imputer", Imputer(missing_values='NaN', strategy="median",axis=0)),
("dummy", OneHotEncoder(categorical_features=np.where(mask), handle_unknown = 'ignore')),
("forest", RF())])
Training works fine, but the trouble comes when I try to test a generated model on new data. The categorical variables possible for this problem are not bounded, and not all possible categorical variables show up in the training dataset. So there might be test data that contain categorical variables that the model has never seen before, resulting in crashes in the prediction process due to mismatched dimensions.
As a concrete example, say one of the features I'm training on is fruit_name. The model trains on many examples of various fruits, including bananas, apples, and oranges. fruit_name is one-hot encoded in the pipeline. However, say I have test data that contains a fruit_name that the model has never seen before, like kiwi. Then the test data will have an extra column to the training data. Alternatively, say the test data doesn't actually contain bananas,apples or oranges. Then it will have fewer columns to the training data. Either way, model testing will crash.
How do I handle this issue with categorical variables using the sklearn pipeline?

Related

Using a different test/train split for each target

I plan on using a data set which contains 3 target values of interest. Ultimately I will be trying classification methods on a binary target and also plan on using regression methods for two separate continuous targets.
Is is it a bad practice to do a different train/test split for each target variable?
Otherwise, I am not sure how to split the data in a way that will allow me to predict each target, separately.
If they're effectively 3 different models trained and evaluated separately then for the purposes of scientifically evaluating each model's performance it doesn't matter if you use different test-train splits for each model, as no information will be leaking from model to model. But if you plan on comparing the results of the 3 models or combining all 3 scores into some aggregate metric then you would want to use the same test-train split so that all 3 models are working from the same training data, as otherwise the performance of each model will likely depend to some extent on the test data for the other models, and therefore your combined score will to some extent be a function of your test data.

What steps should I take next to improve my accuracy? Can data be the problem?

I built various ML models using sklearn for a binary classification problem. The data-set is provided to me by my professor for this comparative study.
my jupyter notebook and dataset can be found here
As I am getting very low accuracy, I fear that I must be doing something wrong while building the model. So I tested my decision tree on the inbuilt data-set in sklearn (breast cancer data-set) which is very similar to my data-set as both are binary classifications. Here I get an mean accuracy of 95 %. So I think right now that the problem might be my data-set. Can I get some help on how do I pre-process my data or any other steps that I might look into to improve accuracy.
Encode labels
Categorical data are variables that contain label values rather than numeric values.The number of possible values is often limited to a fixed set.
For example, users are typically described by country, gender, age group etc. We will use Label Encoder to label the categorical data. Label Encoder is the part of SciKit Learn library in Python and used to convert categorical data, or text data, into numbers, which our predictive models can better understand.
#Encoding categorical data values
from sklearn.preprocessing import LabelEncoder
labelencoder_Y = LabelEncoder()
Y = labelencoder_Y.fit_transform(Y)
Feature scaling
Most of the times, your dataset will contain features highly varying in magnitudes, units and range. But since, most of the machine learning algorithms use Eucledian distance between two data points in their computations. We need to bring all features to the same level of magnitudes. This can be achieved by scaling. This means that you’re transforming your data so that it fits within a specific scale, like 0–100 or 0–1. We will use StandardScaler method from SciKit-Learn library.
#Feature Scalingfrom sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
Choosing Right model
You kight also want to vhoose the appropriate model. You can't just use neural nets or so for all problems it's the no free luch theorem. For this you could use K-fold cross validation, AIC and BIC

Is there a way to extract predicted values, using which XGBoost calculates the train/eval errors (stored in evals_results)?

I am looking to gain a better understanding of how my model learns a particular dataset. I wanted to visualize the training and eval phases of learning by plotting the actual training/eval data alongside model predictions for the same.
I got the idea from observing some Matlab code, which allows the user to plot the above mentioned values. Unfortunately I no longer have access to the Matlab code and would like to recreate the same in Python.
Using the code below:
model = xgb.train(params, dtrain,evals=watchlist,evals_result=results,verbose_eval=False)
I can get a results dictionary which saves, the training and eval rmse values as shown below:
{'eval': {'rmse': [0.557375, 0.504097, 0.449699, 0.404737, 0.364217, 0.327787, 0.295155, 0.266028, 0.235819, 0.212781]}, 'train': {'rmse': [0.405989, 0.370338, 0.337915, 0.308605, 0.281713, 0.257068, 0.234662, 0.214531, 0.195993, 0.179145]}}
While the output shows me the rmse values, I was wondering whether there is a way to get the predicted values for both the training as well as eval set, using which these rmse values are calculated.

Scikit classification on categorical variables - feature importance and hot encoding -which first?

I have a dataframe comprising 23 categorical varables. I would eventually like to build a predictor model (Decision tree/Random forest) to predict if someone will attend an interview or not. This is my target variable. I will use Scikit for this task.
Questions:
As these are categorical variables am I right in saying I need to Hot encode each of my 23 categorical variables before splitting into a train, test and validation sets?
I have also been told to use Feature importance, but I am unsure if I use this before or after Hot encoding? It was my understanding that feature importance will help to reduce the number of features I have to Hot encode, in other words I use feature importance before Hot encoding.
However, the RandomForestClassifier(), I attempted to use for feature importance will not work with strings:
Input: forest = RandomForestClassifier(n_estimators=250, random_state=0)
forest.fit(X, y)
Output: ValueError: could not convert string to float: 'Single'
What would be the best way to go about this please?

Confusion Matrix - Not changing with predictive models (Sklearn)

I have 3 predictive models and I am evaluating there performance with a confusion matrix.
I am getting the same results for the confusion matrix for each of the 3 models.
I expect that the different models would perform differently and produce different confusion matrices. I am new to predictive modelling, so I suspect I am making a "Rooky mistake" . The full script I am using is sitting in a Jupyter notebook on GiThub here
A screenshot of the code for the 3 models is below
Can some one point out what is going wrong?
Cheers
Mike
As mentioned: make predictions on the test data. But keep in mind that your targets are skewed! So use StratifiedKFolds or something like this.
Also I guess that your data is a bit corrupted. While all models show the same result it may be a big mistake underneath.
Few questions/advises:
1. Did you scale your data?
2. Did you use one-hot-encoding?
2. Use don't Decision Trees but Forests/XGBoost. Easy to overfit with DT.
3. Don't use >2 hidden layers in NN because it's easy to overfit too. Use 2 firstly. And your architecture (30, 30, 30) with 2 target classes seems weird.
4. And if you wish to use >2 hidden layers - go to Keras or TF. You'll find there many features that can help you to not overfit.
That is simply because you are using the same Training data to make predictions. Since your models are already trained on the same data that you are making the predictions on, they will return the same results (and ultimately the same confusion matrix). You need to split your dataset into training and test sets. Then train your classifier on training set and make predictions on test set.
You can use train_test_split in Sklearn to split your dataset into training or test set.

Resources