Multi Class Classification using XGBClassifier - scikit-learn

I am using XGBClassifier for multiclass classification(5 classes - [1,2,3,4,5]). I have set objective parameter as 'multi:softmax' but still when I predict using my model I am getting continuous values instead of integers.
I tried with specifying num_class parameter too, but still it predicts continuous values.
model = XGBClassifier(learning_rate = 0.1,n_estimators = 200, objective='multi:softmax')
model.fit(x1, y1, eval_set=[(x1,y1),(x2, y2)], eval_metric='mlogloss')
Expected output = [1,2,3,3,2,3,4,4,5,5,1.... etc] #integers values
Actual Output = [2.334, 1.455, 2.122, 1.76 .... etc] #continuous values

Related

Get unnormalized predict_proba when using "ovr" in LogisticRegression

When using the LogisitcRegression.predict_proba the docs say that, when multiclass = 'ovr' they return the normalized probability for each class.
Is there a way, without having to calculate it using the model.coef_
like
pred_proba_manually = [1/(1+np.exp(-(intercept + tf_idf_val#coef))) for
coef,intercept in zip(logreg.coef_,logreg.intercept_)]
to get the unnormalized probabilities for each class?

MultiOutput Classification with TensorFlow Extended (TFX)

I'm quite new to TFX (TensorFlow Extended), and have been going through the sample tutorial on the TensorFlow portal to understand a bit more to apply it to my dataset.
In my scenario, instead of predicting a single label, the problem at hand requires me to predict 2 outputs (category 1, category 2).
I've done this using pure TensorFlow Keras Functional API and that works fine, but then am now looking to see if that can be fitted into the TFX pipeline.
Where i get the error, is at the Trainer stage of the pipeline, and where it throws the error is in the _input_fn, and i suspect it's because i'm not correctly splitting out the given data into (features, labels) tensor pair in the pipeline.
Scenario:
Each row of the input data comes in the form of
[Col1, Col2, Col3, ClassificationA, ClassificationB]
ClassificationA and ClassificationB are the categorical labels which i'm trying to predict using the Keras Functional Model
The output layer of the keras functional model looks like below, where there's 2 outputs that is joined to a single dense layer (Note: _xf appended to the end is just to illustrate that i've encoded the classes to int representations)
output_1 = tf.keras.layers.Dense(
TargetA_Class, activation='sigmoid',
name = 'ClassificationA_xf')(dense)
output_2 = tf.keras.layers.Dense(
TargetB_Class, activation='sigmoid',
name = 'ClassificationB_xf')(dense)
model = tf.keras.Model(inputs = inputs,
outputs = [output_1, output_2])
In the trainer module file, i've imported the required packages at the start of the module file >
import tensorflow_transform as tft
from tfx.components.tuner.component import TunerFnResult
import tensorflow as tf
from typing import List, Text
from tfx.components.trainer.executor import TrainerFnArgs
from tfx.components.trainer.fn_args_utils import DataAccessor, FnArgs
from tfx_bsl.tfxio import dataset_options
The current input_fn in the trainer module file looks like the below (by following the tutorial)
def _input_fn(file_pattern: List[Text],
data_accessor: DataAccessor,
tf_transform_output: tft.TFTransformOutput,
batch_size: int = 200) -> tf.data.Dataset:
"""Helper function that Generates features and label dataset for tuning/training.
Args:
file_pattern: List of paths or patterns of input tfrecord files.
data_accessor: DataAccessor for converting input to RecordBatch.
tf_transform_output: A TFTransformOutput.
batch_size: representing the number of consecutive elements of returned
dataset to combine in a single batch
Returns:
A dataset that contains (features, indices) tuple where features is a
dictionary of Tensors, and indices is a single Tensor of label indices.
"""
return data_accessor.tf_dataset_factory(
file_pattern,
dataset_options.TensorFlowDatasetOptions(
batch_size=batch_size,
#label_key=[_transformed_name(x) for x in _CATEGORICAL_LABEL_KEYS]),
label_key=_transformed_name(_CATEGORICAL_LABEL_KEYS[0]), _transformed_name(_CATEGORICAL_LABEL_KEYS[1])),
tf_transform_output.transformed_metadata.schema)
When i run the trainer component the error that comes up is:
label_key=_transformed_name(_CATEGORICAL_LABEL_KEYS[0]),transformed_name(_CATEGORICAL_LABEL_KEYS1)),
^ SyntaxError: positional argument follows keyword argument
I've also tried label_key=[_transformed_name(x) for x in _CATEGORICAL_LABEL_KEYS]) which also gives an error.
However, if i just pass in a single label key, label_key=transformed_name(_CATEGORICAL_LABEL_KEYS[0]) then it works fine.
FYI - _CATEGORICAL_LABEL_KEYS is nothing but a list which contains the names of the 2 outputs i'm trying to predict (ClassificationA, ClassificationB).
transformed_name is nothing but a function to return an updated name/key for the transformed data:
def transformed_name(key):
return key + '_xf'
Question:
From what i can see, the label_key argument for dataset_options.TensorFlowDatasetOptions can only accept a single string/name of label, which means it may not be able to output the dataset with multi labels.
Is there a way which i can modify the _input_fn so that i can get the dataset that's returned by _input_fn to work with returning the 2 output labels? So the tensor that's returned looks something like:
Feature_Tensor: {Col1_xf: Col1_transformedfeature_values, Col2_xf:
Col2_transformedfeature_values, Col3_xf:
Col3_transformedfeature_values}
Label_Tensor: {ClassificationA_xf: ClassA_encodedlabels,
ClassificationB_xf: ClassB_encodedlabels}
Would appreciate advice from the wider community of tfx!
Since the label key is optional, maybe instead of specifying it in the TensorflowDatasetOptions, instead you can use dataset.map afterwards and pass both labels after taking them from your dataset.
Haven't tested it but something like:
def _data_augmentation(feature_dict):
features = feature_dict[_transformed_name(x) for x in
_CATEGORICAL_FEATURE_KEYS]]
keys=[_transformed_name(x) for x in _CATEGORICAL_LABEL_KEYS]
return features, keys
def _input_fn(file_pattern: List[Text],
data_accessor: DataAccessor,
tf_transform_output: tft.TFTransformOutput,
batch_size: int = 200) -> tf.data.Dataset:
"""Helper function that Generates features and label dataset for tuning/training.
Args:
file_pattern: List of paths or patterns of input tfrecord files.
data_accessor: DataAccessor for converting input to RecordBatch.
tf_transform_output: A TFTransformOutput.
batch_size: representing the number of consecutive elements of returned
dataset to combine in a single batch
Returns:
A dataset that contains (features, indices) tuple where features is a
dictionary of Tensors, and indices is a single Tensor of label indices.
"""
dataset = data_accessor.tf_dataset_factory(
file_pattern,
dataset_options.TensorFlowDatasetOptions(
batch_size=batch_size,
tf_transform_output.transformed_metadata.schema)
dataset = dataset.map(_data_augmentation)
return dataset

Get feature importance PySpark Naive Bayes classifier

I have a Naive Bayes classifier that I wrote in Python using a Pandas data frame and now I need it in PySpark. My problem here is that I need the feature importance of each column. When looking through the PySpark ML documentation I couldn't find any info on it. documentation
Does anyone know if I can get the feature importance with the Naive Bayes Spark MLlib?
The code using Python is the following. The feature importance is retrieved with .coef_
df = df.fillna(0).toPandas()
X_df = df.drop(['NOT_OPEN', 'unique_id'], axis = 1)
X = X_df.values
Y = df['NOT_OPEN'].values.reshape(-1,1)
mnb = BernoulliNB(fit_prior=True)
y_pred = mnb.fit(X, Y).predict(X)
estimator = mnb.fit(X, Y)
# coef_: For a binary classification problems this is the log of the estimated probability of a feature given the positive class. It means that higher values mean more important features for the positive class.
feature_names = X_df.columns
coefs_with_fns = sorted(zip(estimator.coef_[0], feature_names))
If you're interested in an equivalent of coef_, the property, you're looking for, is NaiveBayesModel.theta
log of class conditional probabilities.
New in version 2.0.0.
i.e.
model = ... # type: NaiveBayesModel
model.theta.toArray() # type: numpy.ndarray
The resulting array is of size (number-of-classes, number-of-features), and rows correspond to consecutive labels.
It is, probably, better to evaluate a difference
log(P(feature_X|positive)) - log(P(feature_X|negative))
as a feature importance.
Because, we are interested in the Discriminative power of each feature_X (sure-sure NB is a generative model).
Extreme example: some feature_X1 has the same value across all + and - samples, so no discriminative power.
So, the probability of this feature value is high for both + and - samples, but the difference of log probabilities = 0.

How to use categorical data neural network in tensorflow without estimator?

I am trying to build a neural network without using estimators. I have defined layers as,
x_categorical = tf.placeholder(tf.string)
x_numeric = tf.placeholder(tf.float32)
l1 = tf.add(tf.matmul(x_numeric,weights), biases)
l2 = tf.add(tf.matmul(x_categorical,weights), biases)
tf.matmul works well for numeric features but i also have some categorical features. So i am unable to use them
I tried tf.string_to_hash_bucket_fast but it converts the string to int64 which is not supported by tf.matmul, i also tried tf.decode_raw. that also did not work. So please help me with this I want use categorical features as well.
To handle categorical values in a Neural Network you have to represent them in OneHot representation. If they are string (as it seems to be your case) you first have to convert them to "Integer representation". Step by step:
Using from sklearn.preprocessing import LabelEncoder,OneHotEncoder
Define you categorial string values
categorical_values = np.array([['Foo','bar','values'],['more','foo','bar'],['many','foo','bar']])
Then encode them as integers:
categorical_values[:,0] = LabelEncoder().fit_transform(categorical_values[:,0])
categorical_values[:,1] = LabelEncoder().fit_transform(categorical_values[:,1])
categorical_values[:,2] = LabelEncoder().fit_transform(categorical_values[:,2])
And use OneHotEncoder to obtain the OneHot representation:
oneHot_values = OneHotEncoder().fit_transform(categorical_values).toarray()
Define your graph:
x_categorical = tf.placeholder(shape=[NUM_OBSERVATIONS,NUM_FEATURES],dtype=tf.float32)
weights = tf.Variable(tf.truncated_normal([NUM_FEATURES,NUM_CLASSES]),dtype=tf.float32)
bias = tf.Variable([NUM_CLASSES],dtype=tf.float32)
l2 = tf.add(tf.matmul(x_categorical,weights),bias)
And execute it obtaining the results:
with tf.Session() as sess:
tf.global_variables_initializer().run()
_l2 = sess.run(l2,feed_dict={x_categorical : oneHot_values})
Edit: As requested, no-sklearn version.
Using just numpy.unique() and tensorflow.one_hot()
categorical_values = np.array(['Foo','bar','values']) #For one observation
lookup, labeledValues = np.unique(categorical_values, return_inverse=True)
oneHotValues = tf.one_hot(labeledValues,depth=NUM_FEATURES)
Full example on the JN linked below
Here you have a Jupyter Notebook with the code on my Github

scikit-learn: Is there a way to provide an object as an input to predict function of a classifier?

I am planning to use an SGDClassifier in production. The idea is to train the classifier on some training data, use cPickle to dump it to a .pkl file and reuse it later in a script. However, there are certain high cardinality fields which are categorical in nature and translated to one hot matrix representation which creates around 5000 features. Now the input that I get for the predict will only have one of these features and rest all will be zeroes. It will also include ofcourse the other numerical features apart from this. From the docs, it appears that the predict function expects an array of array as input. Is there any way I can transform my input to the format expected by the predict function without having to store the fields everytime I train the model ?
Update
So, let us say my input contains 3 fields:
{
rate: 10, // numeric
flagged: 0, //binary
host: 'somehost.com' // keeping this categorical
}
host can have around 5000 different values. Now I loaded the file to a pandas dataframe, used the get_dummies function to transform the host field to around 5000 new fields which are binary fields.
Then I trained by model and stored it using cPickle.
Now, when I need to use the predict function, for the input, I only have 3 fields (shown above). However, as per my understanding the predict endpoint will expect an array of vectors and each vector is supposed to have those 5000 fields.
For the entry that I need to predict, I know only one field for that entry which will be the value of host itself.
For example, if my input is
{
rate: 5,
flagged: 1
host: 'new_host.com'
}
I know that the fields expected by the predict should be:
{
rate: 5,
flagged: 1
new_host: 1
}
But if I translate it to vector format, I don't know which index to place the new_host field. Also, I don't know in advance what other hosts are (unless I store it somewhere during the training phase)
I hope I am making some sense. Let me know if I am doing it the wrong way.
I don't know which index to place the new_host field
A good approach that has worked for me is to build a pipeline which you then use for training and prediction. This way you do not have to concern yourself with the column index of whatever output is produced by your transformation:
# in training
pipl = Pipeline(steps=[('binarizer', LabelBinarizer(),
('clf', SGDClassifier())])
model = pipl.train(X, Y)
pickle.dump(mf, model)
# in production
model = pickle.load(mf)
y = model.predict(X)
As X, Y inputs you need to pass an array-like object. Make sure the input is the same structure for both training and test, e.g.
X = [[data.get('rate'), data.get('flagged'), data.get('host')]]
Y = [[y-cols]] # your example doesn't specify what is Y in your data
More flexible: Pandas DataFrame + Pipeline
What also works nicely is to use a Pandas DataFrame in combination with sklearn-pandas as it allows you to use different transformations on different column names. E.g.
df = pd.DataFrame.from_dict(data)
mapper = DataFrameMapper([
('host', sklearn.preprocessing.LabelBinarizer()),
('rate', sklearn.preprocessing.StandardScaler())
])
pipl = Pipeline(steps=[('mapper', mapper),
('clf', SGDClassifier())])
X = df[x-cols]
y = df[y-col(s)]
pipl.fit()
Note that x-cols and y-col(s) are the list of the feature and target columns respectively.
You should use a scikit-learn transformer instead of get_dummies. In this case, LabelBinarizer makes sense. Seeing as LabelBinarizer doesn't work in a pipeline, this is one way to do what you want:
binarizer = LabelBinarizer()
# fitting LabelBinarizer means it remembers all the columns it's seen
one_hot_data = binarizer.fit_transform(X_train[:, categorical_col])
# replace string column with one-hot representation
X_train = np.concatenate([np.delete(X_train, categorical_col, axis=1),
one_hot_data], axis=1)
model = SGDClassifier()
clf.fit(X_train, y)
pickle.dump(f, {'clf': clf, 'binarizer': binarizer})
then at prediction time:
estimators = pickle.load(f)
clf = estimators['clf']
binarizer = estimators['binarizer']
one_hot_data = binarizer.transform(X_test[:, categorical_col])
X_test = np.concatenate([np.delete(X_test, categorical_col, axis=1),
one_hot_data], axis=1)
clf.predict(X_test)

Resources