SparkML LogisticRegression vs Sklearn's: Different coefficients and intercepts - scikit-learn

I may be missing some initialization parameters or something like that.
I created an LR in pyspark and then one using Scikit. I trained then using the same data. When I got the resulting models I compared their parameters. The Coefficients were very different. The intercepts in pyspark turned out to be a single number and that is still very different from that of sklearn.
The predictions are the same in terms of labels however the probability values are different
pyspark==2.3.2
scikit-learn==0.20.2
pyspark LR creation:
data = self.spark.read.format("libsvm").load(input_path)
lr = LogisticRegression(maxIter=10, tol=0.0001)
model = lr.fit(data)
predicted = model.transform(data)
sklearn creation:
sk_model = linear_model.LogisticRegression()
sk_model.fit(np_x, np_y)
sk_expected = [sk_model.predict(np_x), sk_model.predict_proba(np_x)]
Data conversion from spark to numpy:
self.spark.udf.register("sparseToArray", lambda x: x.toArray().tolist(),
ArrayType(elementType=FloatType(), containsNull=False))
np_y = data.select("label").toPandas().label.values.astype(numpy.float32)
np_x = data.selectExpr("sparseToArray(features) as features").toPandas().features.apply(pandas.Series).values.astype(numpy.float32)

Related

Xgboost on Spark Validation Indicator Column and Evaluation Metric

I am using the xgboost PySpark API. This API is experimental but it supports most of the features of the xgboost API.
As per the documentation below, eval_set parameter is not supported and instead, validationIndicatorCol parameter should be used.
https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.spark
https://databricks.github.io/spark-deep-learning/#module-sparkdl.xgboost
xgb = XgboostClassifier(featuresCol = "features",
labelCol="label",
num_workers = 40,
random_state = 1,
missing = None,
objective = 'binary:logistic',
validationIndicatorCol = 'isVal',
eval_metric = 'aucpr' ,
n_estimators = best_n_estimators,
max_depth = best_max_depth,
learning_rate = best_learning_rate
)
pipeline = Pipeline(stages=[vectorAssembler,xgb])
pipelineModel = pipeline.fit(sampled_df)
It seems to be running without any errors which is great.
How do you print and look at the evaluation results? Traditional xgboost has evals_result() method which pipelineModel.stages[-1].evals_result() doesn't seem to work in the PySpark API. This method should normally work since the PySpark API documentation doesn't say otherwise. Any idea on how to make it work?

probability difference between categorical target and one-hot encoding target using OneVsRestClassifier

A bit confused with the probability between categorical target and one-hot encoding target from OneVsRestClassifier of sklean. Using iris data with simple logistic regression as an example. When I use original iris class[0,1,2], the calculated OneVsRestClassifier() probability for each observation will always add up to 1. However, if I converted the target to dummies, this is not the case. I understand that OneVsRestClassifier() compares one vs rest (class 0 vs non class 0, class 1 vs non class 1, etc). It makes more sense that the sum of these probabilities has no relation with 1. Then why I see the difference and how so?
import numpy as np
import pandas as pd
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn import datasets
np.set_printoptions(suppress=True)
iris = datasets.load_iris()
rng = np.random.RandomState(0)
perm = rng.permutation(iris.target.size)
X = iris.data[perm]
y = iris.target[perm]
# categorical target with no conversion
X_train, y_train1 = X[:80], y[:80]
X_test, y_test1 = X[80:], y[80:]
m3 = LogisticRegression(random_state=0)
clf1 = OneVsRestClassifier(m3).fit(X_train, y_train1)
y_pred1 = clf1.predict(X_test)
print(np.sum(y_pred1 == y_test))
y_prob1 = clf1.predict_proba(X_test)
y_prob1[:5]
#output
array([[0.00014508, 0.17238549, 0.82746943],
[0.03850173, 0.79646817, 0.1650301 ],
[0.73981106, 0.26018067, 0.00000827],
[0.00016332, 0.32231163, 0.67752505],
[0.00029197, 0.2495404 , 0.75016763]])
# one hot encoding for categorical target
y2 = pd.get_dummies(y)
y_train2 = y2[:80]
y_test2 = y2[80:]
clf2 = OneVsRestClassifier(m3).fit(X_train, y_train2)
y_pred2 = clf2.predict(X_test)
y_prob2 = clf2.predict_proba(X_test)
y_prob2[:5]
#output
array([[0.00017194, 0.20430011, 0.98066319],
[0.02152246, 0.44522562, 0.09225181],
[0.96277892, 0.3385952 , 0.00001076],
[0.00023024, 0.45436925, 0.95512082],
[0.00036849, 0.31493725, 0.94676348]])
When you encode the targets, sklearn interprets your problem as a multilabel one instead of just multiclass; that is, that it is possible for a point to have more than one true label. And in that case, it is perfectly acceptable for the total sum of probabilities to be greater (or less) than 1. That's generally true for sklearn, but OneVsRestClassifier calls it out specifically in the docstring:
OneVsRestClassifier can also be used for multilabel classification. To use this feature, provide an indicator matrix for the target y when calling .fit.
As for the first approach, there are indeed three independent models, but the predictions are normalized; see the source code. Indeed, that's the only difference:
(y_prob2 / y_prob2.sum(axis=1)[:, None] == y_prob1).all()
# output
True
It's probably worth pointing out that LogisticRegression also natively supports multiclass. In that case, the weights for each class are independent, so it's similar to three separate models, but the resulting probabilities are the result of a softmax application, and the loss function minimizes the loss for each class simultaneously, so that the resulting coefficients and hence predictions can be different from those obtained from OneVsRestClassifier:
m3.fit(X_train, y_train1)
y_prob0 = m3.predict_proba(X_test)
y_prob0[:5]
# output:
array([[0.00000494, 0.01381671, 0.98617835],
[0.02569699, 0.88835451, 0.0859485 ],
[0.95239985, 0.04759984, 0.00000031],
[0.00001338, 0.04195642, 0.9580302 ],
[0.00002815, 0.04230022, 0.95767163]])

How to get feature importances/feature ranking from summary plot in SHAP without crashing?

I am attempting to get shap values out of an array which was created by
explainer = shap.Explainer(xg_clf, X_train)
shap_values2 = explainer(X_train)
using my XGBoost data, to make a dataframe of feature_names and their SHAP importance, as they would appear in a SHAP bar or summary plot.
Following advice from how to extract the most important feature names? and How to get feature names of shap_values from TreeExplainer? specifically the comment by user Thoo, which shows how the values can be extracted to make a dataframe:
vals= np.abs(shap_values).mean(0)
feature_importance = pd.DataFrame(list(zip(X_train.columns,vals)),columns=['col_name','feature_importance_vals'])
feature_importance.sort_values(by=['feature_importance_vals'],ascending=False,inplace=True)
feature_importance.head()
shap_values has 11595 persons with 595 features each, which I understand is large, but, creating the vals variable runs very slowly, about 58 minutes on my laptop. It uses almost all RAM on the computer.
After 58 minutes I get an error:
Command terminated by signal 9
which as far as I understand, means that the computer ran out of RAM.
I've tried converting the 2nd line in Thoo's code to
feature_importance = pd.DataFrame(list(zip(X_train.columns,np.abs(shap_values2).mean(0))),columns=['col_name','feature_importance_vals'])
so that vals isn't stored but this change doesn't reduce RAM at all.
I've also tried a different comment from the same GitHub issue (user "ba1mn"):
def global_shap_importance(model, X):
""" Return a dataframe containing the features sorted by Shap importance
Parameters
----------
model : The tree-based model
X : pd.Dataframe
training set/test set/the whole dataset ... (without the label)
Returns
-------
pd.Dataframe
A dataframe containing the features sorted by Shap importance
"""
explainer = shap.Explainer(model)
shap_values = explainer(X)
cohorts = {"": shap_values}
cohort_labels = list(cohorts.keys())
cohort_exps = list(cohorts.values())
for i in range(len(cohort_exps)):
if len(cohort_exps[i].shape) == 2:
cohort_exps[i] = cohort_exps[i].abs.mean(0)
features = cohort_exps[0].data
feature_names = cohort_exps[0].feature_names
values = np.array([cohort_exps[i].values for i in range(len(cohort_exps))])
feature_importance = pd.DataFrame(
list(zip(feature_names, sum(values))), columns=['features', 'importance'])
feature_importance.sort_values(
by=['importance'], ascending=False, inplace=True)
return feature_importance
but global_shap_importance returns the feature importances in the wrong order, and I don't see how I can alter global_shap_importance so that the features are returned in the same order as summary_plot (beeswarm plot).
How can I get the feature importance ranking into a dataframe?
I pulled this straight from the source code. Confirmed identical to the summary_plot.
def shapley_feature_ranking(shap_values, X):
feature_order = np.argsort(np.mean(np.abs(shap_values), axis=0))
return pd.DataFrame(
{
"features": [X.columns[i] for i in feature_order][::-1],
"importance": [
np.mean(np.abs(shap_values), axis=0)[i] for i in feature_order
][::-1],
}
)
shapley_feature_ranking(shap_values[0], X)

Is there a way to add a 'sentiment' column after applying CountVectorizer or TfIdfTransformer to a dataframe?

I am working with app store reviews to classify them as class "0" or class "1" based on the text in the review and the sentiment the review carries.
In my classification steps I apply the following methods to my dataframe:
def get_sentiment(s):
vs = analyzer.polarity_scores(s)
if vs['compound'] >= 0.5:
return 1
elif vs['compound'] <= -0.5:
return -1
else:
return 0
df['sentiment'] = df['review'].apply(get_sentiment)
For simplicity sake, the data has already been labeled as either class '0' or '1', but I am training the model for the classification of new instances that have not been labeled yet. In short, the data I'm working with has already been labeled. They are in the classification column.
Then in my train test split method do the following:
msg_train, msg_test, label_train, label_test = train_test_split(df.drop('classification', axis=1), df['classification'], test_size=0.3, random_state=42)
So the dataframe for the X parameter has review and sentiment, and for the y parameter I only have the classification that I am training my model on.
Since the normalization is repetitive, I am running a pipeline like so for simplicity:
pipeline1 = Pipeline([
('bow', CountVectorizer(analyzer=clean_review)),
('tfidf', TfidfTransformer()),
('classifier', MultinomialNB())
])
Where the clean_review function is as follows:
def clean_review(sentence):
no_punc = [c for c in sentence if c not in string.punctuation]
no_punc = ''.join(no_punc)
no_stopwords = [w.lower() for w in no_punc.split() if w not in stopwords_set]
stemmed_words = [ps.stem(w) for w in no_stopwords]
return stemmed_words
Where stopwords_set is the collection of english stopwords from the nltk library, and ps is from the PortStemmer module in the nltk library (for word stemming).
I get the following error: ValueError: Found input variables with inconsistent numbers of samples: [2, 505]
When I searched this error before, I saw that the likely issue could've been that there is a mismatch in the number of records for each attribute. I've found this not to be the case. All the records that I am using have values for every column.
Can someone else help me interpret what this error could mean?
My end goal is to have a dataframe that has the CountVectorizer and TfIdfTransformer applied to the text, but also retain the column for the sentiment of each review.
I would then like to be able to train the MultinomialNB classifier on this dataframe and apply this model to other tasks.
I'm not sure on what the error is due to since I don't know what the size of your dataframe should be. I would need more information. On which line is the error thrown?
Regarding the fact that you want to retain the sentiment column, you could apply CountVectorizer and TfIdfTransformer (by the way you could skip a step and directly apply TfidfVectorizer) only on the text data and then have another transformer in the pipeline which adds the original sentiment column before you feed the dataframe to the classifier.

Getting results of H2O neural nets from: h2o.grid.grid_search.H2OGridSearch

I have been training a neural net with hyperparameters but am unable get results out as I am getting the following error message.
nn
Error message: 'int' object is not iterable
Code:
nn = H2OGridSearch(model=H2ODeepLearningEstimator,
hyper_params = {
'activation' :[ "Rectifier","Tanh","Maxout","RectifierWithDropout","TanhWithDropout","MaxoutWithDropout"],
'hidden':[[20,20],[50,50],[30,30,30],[25,25,25,25]], ## small network, runs faster
'epochs':1000000, ## hopefully converges earlier...
'rate' :[0.0005,0.001,0.0015,0.002,0.0025,0.003,0.0035,0.0040,0.0045,0.005],
'score_validation_samples':10000, ## sample the validation dataset (faster)
'stopping_rounds':2,
'stopping_metric':"misclassification", ## alternatives: "MSE","logloss","r2"
'stopping_tolerance':0.01})
nn.train(train1_x, train1_y,train1)
There is a slight problem with how you are defining the grid. You can only pass a dictionary of lists (of values to grid over for each hyperparamter) in the hyper_params argument. The reason you are seeing the Error message: 'int' object is not iterable error message is because you are trying to pass an integer instead of a list for both score_validation_samples and stopping_rounds.
If there are arguments that you don't intend to grid over, then they should be passed instead to the grid's train() method. I'd also recommend using a validation frame or cross-validation when doing grid search so you don't have to use training metrics to choose the best model. See example below.
import h2o
from h2o.estimators.deeplearning import H2ODeepLearningEstimator
from h2o.grid.grid_search import H2OGridSearch
h2o.init()
# Import a sample binary outcome training set into H2O
train = h2o.import_file("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")
# Identify predictors and response
x = train.columns
y = "response"
x.remove(y)
# For binary classification, response should be a factor
train[y] = train[y].asfactor()
# Execute a grid search (also do 5-fold CV)
grid = H2OGridSearch(model=H2ODeepLearningEstimator, hyper_params = {
'activation' :["Rectifier","Tanh","Maxout","RectifierWithDropout","TanhWithDropout","MaxoutWithDropout"],
'hidden':[[20,20],[50,50],[30,30,30],[25,25,25,25]]})
grid.train(x=x, y=y, training_frame=train, \
score_validation_samples=10000, \
stopping_rounds=2, \
stopping_metric="misclassification", \
stopping_tolerance=0.01, \
nfolds=5)
# Look at grid results
gridperf = grid.get_grid(sort_by='mean_per_class_error')
There are more examples of how to use grid search in the H2O Python Grid Search tutorial.

Resources