Usng same Label Encoder to test dataset? or new Label Encoder? - scikit-learn

I'm totally novice on scikit-learn.
I want to know whether I should use the same Label Encoder instance that had used on training dataset or not when I want to convert the same feature's categorical data on test dataset. And, it means like below
from sklearn import preprocessing
# trainig data label encoding
le_blood_type = preprocessing.LabelEncoder()
df_training[ 'BLOOD_TYPE' ] = le_blood_type.fit_transform( df_training[ 'BLOOD_TYPE' ] ) # labeling from string
....
1. Using same label encoder
df_test[ 'BLOOD_TYPE' ] = le_blood_type.fit_transform( df_test[ 'BLOOD_TYPE' ] )
2. Using different label encoder
le_for_test_blood_type = preprocessing.LabelEncoder()
df_test[ 'BLOOD_TYPE' ] = le_for_test_blood_type.fit_transform( df_test[ 'BLOOD_TYPE' ] )
Which one is right code?
Or, whatever I choose the above's code it does not make any differences
because training dataset's categorical data and test dataset's categorical data should be the same as a result.

The problem is the way you use it in fact.
As LabelEncoder is associating nominal feature to a numeric increment you should fit once and transform once the object has fitted. Don't forget that you need to have all your nominal feature in the training phase.
The good way to use it may be to have you nominal feature, do a fit on it, then only use the transform method.
>>> from sklearn import preprocessing
>>> le = preprocessing.LabelEncoder()
>>> le.fit([1, 2, 2, 6])
LabelEncoder()
>>> le.classes_
array([1, 2, 6])
>>> le.transform([1, 1, 2, 6])
array([0, 0, 1, 2]...)
from official doc

I think RPresle has already gave the answer. Just wanted to put it a little more direct to the situation in the question:
In general, you just need to fit LabelEncoder (with feature in training set) once and transforms the feature in testing set. But if your testing set has feature values that are not in training set, when you fit the label encoder put union of set of training feature and of testing set in it.

Related

What is meant by id's and labels in keras data generator?

https://stanford.edu/~shervine/blog/keras-how-to-generate-data-on-the-fly
The like above is the documentation regarding the custom keras data generator.
I have doubt in the "NOTATION" heading in the above link which says the following:-
Before getting started, let's go through a few organizational tips that are particularly useful when dealing with large datasets.
Let ID be the Python string that identifies a given sample of the dataset. A good way to keep track of samples and their labels is to adopt the following framework:
1. Create a dictionary called partition where you gather:
a) in partition['train'] a list of training IDs
b) in partition['validation'] a list of validation IDs
2. Create a dictionary called labels where for each ID of the dataset, the associated label is given by labels[ID]
For example, let's say that our training set contains id-1, id-2 and id-3 with respective labels 0, 1 and 2, with a validation set containing id-4 with label 1. In that case, the Python variables partition and labels look like
>>> partition
{'train': ['id-1', 'id-2', 'id-3'], 'validation': ['id-4']}
and
>>> labels
{'id-1': 0, 'id-2': 1, 'id-3': 2, 'id-4': 1}
I'm really not able to understand what does labels and id's mean.
For example:- Say, I have a data frame, where there are 1000 columns. Each row corresponds to id's i.e., each ID meant to be just a "DATA POINT".
OR
Say, I have multiple data frame. Each data frame represents different id's?
It seems labels meant not to be the number of class-variable.
I would like to have a clear understanding regarding id's and labels WITH SOME EXAMPLES.
The mentioned article provides a good practice to better organize your data between training and validation. To do so, it's relevant to store line indexes from your dataframe (named IDs here) and corresponding target values (named label here) in an independent object so that in case of transformation on the input, you don't lose track of things.
Here is a basic example using a train/test split
import pandas as pd
from sklearn.model_selection import train_test_split
df = pd.DataFrame([[0.1, 1, 'label_a'], [0.2, 2, 'label_a'], [0.3, 3, 'label_a'], [0.4, 4, 'label_b']], columns=['feature_a', 'feature_b', 'target'])
# df.index.tolist() results in [0, 1, 2, 3] (4 rows)
partitions = dict()
labels = dict()
X_train, X_test, y_train, y_test = train_test_split(df[['feature_a', 'feature_b']], df['target'], test_size=0.25, random_state=42)
partitions['train'] = X_train.index.tolist()
partitions['validation'] = X_test.index.tolist()
# partitions['train'] results in [3, 0, 2]
# partitions['validation'] results in [1]
labels = df['target'].to_dict()
# labels is {0: 'label_a', 1: 'label_a', 2: 'label_a', 3: 'label_b'}```

Soft-impute on the test set with fancyimpute

The python package fancyimpute provides several data imputation methods. I have tried to use the soft-impute approach; however, soft-impute doesn't offer a transform method to be used on the test dataset. More precisely, Sklearn SimpleImputer (for example below) provides fit, transform and fit_transform methods. On the other hand, SoftImpute provides the only fit_transform, which allows me to fit the data on training but not transform it into the testing set. I understand that fitting the imputation on the training and testing sets will cause data-leak from the testing set into the training. To this end, we need to fit on the training and transform on testing. Are there any ways of imputing the test set of what I fitted from the training set in soft-impute approach?. I appreciate any thoughts.
# this example from https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html
import numpy as np
from sklearn.impute import SimpleImputer
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
imp_mean.fit([[7, 2, 3], [4, np.nan, 6], [10, 5, 9]])
X_train = [[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9]]
print(imp_mean.transform(X_train))
# SimpleImputer provides transform method, so we can apply fitted imputation into the
testing data e.g.
# X_test =[...]
# print(imp_mean.transform(X_test))
from fancyimpute import SoftImpute
clf = SoftImpute(verbose=True)
clf.fit_transform(X_train)
## There is no clf.tranform to be used with test set e.g. clf.transform(X_test)
Fancy impute doesn't support inductive mode. The important thing here is to fill in the training data without using test data. I think you can impute test data using imputed training data. Sample code:
len_train_data=train_df.shape[0]<br>
imputer=SoftImpute() <br>
#impute train data <br>
X_train_fill_SVD = imputer.fit_transform(train_df)<br>
X_train_fill_SVD=pd.DataFrame(X_train_fill_SVD)<br>
#concat imputed train and test<br>
Concat_data=pd.concat((X_train_fill_SVD,test_df),axis=0)<br>
Concat_data=imputer.fit_transform(Concat_data)<br>
Concat_data=pd.DataFrame(Concat_data)<br>
#fetch imputed test data <br>
X_test_fill_SVD=Concat_data.iloc[len_train_data:,]<br>

Should the same imputer co-efficients be used for training and test datasets?

I am learning how to prepare data, build estimators and check using a train/test data split.
My question is how I can prepare the test dataset correctly.
I split my data into a test and a training set. And as "Hands on with machine learning with Scikit-Learn" teaches me, I set up a pipeline for my data preparation:
num_pipeline = Pipeline([
('imputer', SimpleImputer(strategy="median")),
('std_scaler', StandardScaler()),
])
After training my estimator, I want to use my trained estimator on test data to validate my accuracy. However if I pass my test feature data through the pipeline I defined, isn't it calculating a new median value from only the test dataset and the std_scalar based on the test dataset which will be different values to what were arrived at in the training dataset?
I presume for consistency I want to re-use the variables achieved during training. That is what the estimator has been fitted on. For example, if the test set was just a single row (or in production I have a single input I want to derive a prediction from), then the median values wouldn't even be achievable if the single input has a NaN!
What step am I missing?
you must keep in mind, what is happening:
Imagen you have the following dataset (input features):
data = [[0, 1], [1, 0], [1, 0], [1, 1]]
scaler = StandardScaler()
scaler.fit(data)
print(scaler.mean_)
[0.75 0.55]
print(scaler.transform(data))
[[-1.73205081 1. ]
[ 0.57735027 -1. ]
[ 0.57735027 -1. ]
[ 0.57735027 1. ]]
but now if you only use (what you are doing in your approach):
data = [[0, 1], [1, 0]]
data2 = [[1,0], [1,1]]
scaler = StandardScaler()
scaler.fit(data)
print(scaler.mean_)
[0.5 0.5]
print(scaler.transform(data2))
[[ 1. -1.]
[ 1. 1.]]
but as test data is named: keep the data completly untouched before you run your algorithm.
https://stats.stackexchange.com/questions/267012/difference-between-preprocessing-train-and-test-set-before-and-after-splitting

Is it possible to use labelled data in SKLearn?

Currently my code looks like:
clf = RandomForestClassifier(n_estimators=10, criterion='entropy')
clf = clf.fit(X, Y)
However X is an array like:
X = [[0, 1], [1, 1]]
I would prefer to use X like:
X = [{'avg': 0, 'stddev': 1}, {'avg': 1, 'stddev': 1}]
Simply because plotting a tree (as described here: http://scikit-learn.org/stable/modules/tree.html#classification ) makes much more sense when you read X[0]['avg'] rather than X[0][0]. Is it possible? Using dictionary or pandas?
You can use the DictVectorizer class to convert such a list of dicts to sparse matrices or dense numpy arrays.
scikit-learn will never use dict objects as the primary datastructure to store records internally as this not memory efficient at all compared to numpy arrays or scipy sparse matrices.
Here's is a great example by 'larsmans' on how to build a feature dict and use DictVectorizer before fitting a model on the data. Note that DictVectorizer class uses scipy.sparse matrix by default (instead of a numpy.ndarray) to make the resulting data structure able to fit in memory. As not all sklearn learning models support sparse matrices you might want to use sparse=False option in the constructor to obtain a dense array
dv = DictVectorizer(sparse=False)
Alternatively, you can specify feature names when using export_graphviz. This will generate
a tree with more meaningful labels at test nodes.
See the feature_names parameter at http://scikit-learn.org/dev/modules/generated/sklearn.tree.export_graphviz.html#sklearn.tree.export_graphviz

How to find key trees/features from a trained random forest?

I am using Scikit-Learn Random Forest Classifier and trying to extract the meaningful trees/features in order to better understand the prediction results.
I found this method which seems relevant in the documention (http://scikit-learn.org/dev/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.get_params), but couldn't find an example how to use it.
I am also hoping to visualize those trees if possible, any relevant code would be great.
Thank you!
I think you're looking for Forest.feature_importances_. This allows you to see what the relative importance of each input feature is to your final model. Here's a simple example.
import random
import numpy as np
from sklearn.ensemble import RandomForestClassifier
#Lets set up a training dataset. We'll make 100 entries, each with 19 features and
#each row classified as either 0 and 1. We'll control the first 3 features to artificially
#set the first 3 features of rows classified as "1" to a set value, so that we know these are the "important" features. If we do it right, the model should point out these three as important.
#The rest of the features will just be noise.
train_data = [] ##must be all floats.
for x in range(100):
line = []
if random.random()>0.5:
line.append(1.0)
#Let's add 3 features that we know indicate a row classified as "1".
line.append(.77)
line.append(.33)
line.append(.55)
for x in range(16):#fill in the rest with noise
line.append(random.random())
else:
#this is a "0" row, so fill it with noise.
line.append(0.0)
for x in range(19):
line.append(random.random())
train_data.append(line)
train_data = np.array(train_data)
# Create the random forest object which will include all the parameters
# for the fit. Make sure to set compute_importances=True
Forest = RandomForestClassifier(n_estimators = 100, compute_importances=True)
# Fit the training data to the training output and create the decision
# trees. This tells the model that the first column in our data is the classification,
# and the rest of the columns are the features.
Forest = Forest.fit(train_data[0::,1::],train_data[0::,0])
#now you can see the importance of each feature in Forest.feature_importances_
# these values will all add up to one. Let's call the "important" ones the ones that are above average.
important_features = []
for x,i in enumerate(Forest.feature_importances_):
if i>np.average(Forest.feature_importances_):
important_features.append(str(x))
print 'Most important features:',', '.join(important_features)
#we see that the model correctly detected that the first three features are the most important, just as we expected!
To get the relative feature importances, read the relevant section of the documentation along with the code of the linked examples in that same section.
The trees themselves are stored in the estimators_ attribute of the random forest instance (only after the call to the fit method). Now to extract a "key tree" one would first require you to define what it is and what you are expecting to do with it.
You could rank the individual trees by computing there score on held out test set but I don't know what expect to get out of that.
Do you want to prune the forest to make it faster to predict by reducing the number of trees without decreasing the aggregate forest accuracy?
Here is how I visualize the tree:
First make the model after you have done all of the preprocessing, splitting, etc:
# max number of trees = 100
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 100, criterion = 'entropy', random_state = 0)
classifier.fit(X_train, y_train)
Make predictions:
# Predicting the Test set results
y_pred = classifier.predict(X_test)
Then make the plot of importances. The variable dataset is the name of the original dataframe.
# get importances from RF
importances = classifier.feature_importances_
# then sort them descending
indices = np.argsort(importances)
# get the features from the original data set
features = dataset.columns[0:26]
# plot them with a horizontal bar chart
plt.figure(1)
plt.title('Feature Importances')
plt.barh(range(len(indices)), importances[indices], color='b', align='center')
plt.yticks(range(len(indices)), features[indices])
plt.xlabel('Relative Importance')
This yields a plot as below:

Resources