Spark MLlib packages NaN weight - apache-spark

I am trying to run Spark MLlib packages in pyspark with a test machine learning data set. I am splitting the data sets into half training data set and half test data set. Below is my code that builds the model. However, it shows weight of NaN, NaN.. across all dependent variables. Couldn't figure out why. But it works when I try to standardize the data with the StandardScaler function.
model = LinearRegressionWithSGD.train(train_data, step = 0.01)
# evaluate model on test data set
valuesAndPreds = test_data.map(lambda p: (p.label, model.predict(p.features)))
Thank you very much for the help.
Below is the code that I used to do the scaling.
scaler = StandardScaler(withMean = True, withStd = True).fit(data.map(lambda x:x.features))
feature = [scaler.transform(x) for x in data.map(lambda x:x.features).collect()]
label = data.map(lambda x:x.label).collect()
scaledData = [LabeledPoint(l, f) for l,f in zip(label, feature)]

Try scaling the features
StandardScaler standardizes features by scaling to unit variance and/or removing the mean using column summary statistics on the samples in the training set. This is a very common pre-processing step.
Standardization can improve the convergence rate during the optimization process, and also prevents against features with very large variances exerting an overly large influence during model training. Since you have some variables that are large numbers (eg: revenue) and some variables that are smaller (eg: number of clients), this should solve your problem.

Related

What steps should I take next to improve my accuracy? Can data be the problem?

I built various ML models using sklearn for a binary classification problem. The data-set is provided to me by my professor for this comparative study.
my jupyter notebook and dataset can be found here
As I am getting very low accuracy, I fear that I must be doing something wrong while building the model. So I tested my decision tree on the inbuilt data-set in sklearn (breast cancer data-set) which is very similar to my data-set as both are binary classifications. Here I get an mean accuracy of 95 %. So I think right now that the problem might be my data-set. Can I get some help on how do I pre-process my data or any other steps that I might look into to improve accuracy.
Encode labels
Categorical data are variables that contain label values rather than numeric values.The number of possible values is often limited to a fixed set.
For example, users are typically described by country, gender, age group etc. We will use Label Encoder to label the categorical data. Label Encoder is the part of SciKit Learn library in Python and used to convert categorical data, or text data, into numbers, which our predictive models can better understand.
#Encoding categorical data values
from sklearn.preprocessing import LabelEncoder
labelencoder_Y = LabelEncoder()
Y = labelencoder_Y.fit_transform(Y)
Feature scaling
Most of the times, your dataset will contain features highly varying in magnitudes, units and range. But since, most of the machine learning algorithms use Eucledian distance between two data points in their computations. We need to bring all features to the same level of magnitudes. This can be achieved by scaling. This means that you’re transforming your data so that it fits within a specific scale, like 0–100 or 0–1. We will use StandardScaler method from SciKit-Learn library.
#Feature Scalingfrom sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
Choosing Right model
You kight also want to vhoose the appropriate model. You can't just use neural nets or so for all problems it's the no free luch theorem. For this you could use K-fold cross validation, AIC and BIC

How to correctly scale new data points sklearn

Imagine a simple regression problem, where you are using Gradient Descent. For correct implementation you will need to scale values using mean of entire training dataset. Imagine your model is already trained and you feed it another example you wish to predict. How do you scale it correctly with respect to previous dataset? Do you include new example in training set and then scale it with mean of this training dataset + new data points? How this should be done the right way?
By referring to new data points i mean something the model hasn't seen before, neither in training nor testing. How do you handle scaling for anything you pass to regr.predict() if scaling of training set is done with respect to the whole set and not a single observation.
Imagine you have ndarray of features:
to_predict = [10, 12, 1, 330, 1311, 225].
The dataset used for train and test is already oscillating around 0 for every feature. Taking into account the below answer (pseudo code, that's why I'm asking how to do it right):
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
new_Xs = X_train.append(to_predict)
X_train_std_with_new = scalar.fit_transform(new_Xs)
scaled_to_predit = X_train_std_with_new[-1]
regr.predict(scaled_to_predict) ??

Neural network regression with skewed data

I have been trying to build a machine learning model using Keras which predicts the radiation dose based on pre-treatment parameters. My dataset has approximately 2200 samples of which 20% goes into validation and testing.
The problem with the target variable is that it is very skewed since large radiation doses are much more rare than the small ones. Hence, I suspect that my regression model fails to predict the large values at all, and predicts everything around the mean, which is apparent from the figure. I have tried to log-normalise the target variable to make it more normally distributed, but it has had no effect.
Any suggestion how to fix this?
Target variable
Regression predictions
Computing individual sample weights based on 10 histogram bins helped in my case. See the code below:
import pandas as pd
import numpy as np
from sklearn.utils.class_weight import compute_sample_weight
hist, bin_edges = np.histogram(training_targets, bins = 10)
classes = training_targets.apply(lambda x: pd.cut(x, bin_edges, labels = False,
include_lowest = True)).values
sample_weights = compute_sample_weight('balanced', classes)

Sklearn overfitting

I have a data set containing 1000 points each with 2 inputs and 1 output. It has been split into 80% for training and 20% for testing purpose. I am training it using sklearn support vector regressor. I have got 100% accuracy with training set but results obtained with test set are not good. I think it may be because of overfitting. Please can you suggest me something to solve the problem.
You may be right: if your model scores very high on the training data, but it does poorly on the test data, it is usually a symptom of overfitting. You need to retrain your model under a different situation. I assume you are using train_test_split provided in sklearn, or a similar mechanism which guarantees that your split is fair and random. So, you will need to tweak the hyperparameters of SVR and create several models and see which one does best on your test data.
If you look at the SVR documentation, you will see that it can be initiated using several input parameters, each of which could be set to a number of different values. For the simplicity, let's assume you are only dealing with two parameters that you want to tweak: 'kernel' and 'C', while keeping the third parameter 'degree' set to 4. You are considering 'rbf' and 'linear' for kernel, and 0.1, 1, 10 for C. A simple solution is this:
for kernel in ('rbf', 'linear'):
for c in (0.1, 1, 10):
svr = SVR(kernel=kernel, C=c, degree=4)
svr.fit(train_features, train_target)
score = svr.score(test_features, test_target)
print kernel, c, score
This way, you can generate 6 models and see which parameters lead to the best score, which will be the best model to choose, given these parameters.
A simpler way is to let sklearn to do most of this work for you, using GridSearchCV (or RandomizedSearchCV):
parameters = {'kernel':('linear', 'rbf'), 'C':(0.1, 1, 10)}
clf = GridSearchCV(SVC(degree=4), parameters)
clf.fit(train_features, train_target)
print clf.best_score_
print clf.best_params_
model = clf.best_estimator_ # This is your model
I am working on a little tool to simplify using sklearn for small projects, and make it a matter of configuring a yaml file, and letting the tool do all the work for you. It is available on my github account. You might want to take a look and see if it helps.
Finally, your data may not be linear. In that case you may want to try using something like PolynomialFeatures to generate new nonlinear features based on the existing ones and see if it improves your model quality.
Try fitting your data using training data split Sklearn K-Fold cross-validation, this provides you a fair split of data and better model , though at a cost of performance , which should really matter for small dataset and where the priority is accuracy.
A few hints:
Since you have only two inputs, it would be great if you plot your data. Try either a scatter with alpha = 0.3 or a heatmap.
Try GridSearchCV, as mentioned by #shahins.
Especially, try different values for the C parameter. As mentioned in the docs, if you have a lot of noisy observations you should decrease it. It corresponds to regularize more the estimation.
If it's taking too long, you can also try RandomizedSearchCV
As a side note from #shahins answer (I am not allowed to add comments), both implementations are not equivalent. GridSearchCV is better since it performs cross-validation in the training set for tuning the hyperparameters. Do not use the test set for tuning hyperparameters!
Don't forget to scale your data

Spark, MLlib: Adjusting classifier descrimination threshold

I try to use Spark MLlib Logistic Regression (LR) and/or Random Forests (RF) classifiers to create model to descriminate between two classes reprsented by sets which cardinality differes quite a lot.
One set has 150 000 000 negative and and another just 50 000 positive instances.
After training both LR and RF classifiers with default parameters I get very similar results for both classifiers with, for example, for the following test set:
Test instances: 26842
Test positives = 433.0
Test negatives = 26409.0
Classifier detects:
truePositives = 0.0
trueNegatives = 26409.0
falsePositives = 433.0
falseNegatives = 0.0
Precision = 0.9838685641904478
Recall = 0.9838685641904478
It looks like classifier can not detect any positive instance at all.
Also, no matter how data was split into train and test sets, classifier provides exactly the same number of false positives equal to a number of positives that test set really has.
LR classifier default threshold is set to 0.5 Setting threshold to 0.8 does not make any difference.
val model = new LogisticRegressionWithLBFGS().run(training)
model.setThreshold(0.8)
Questions:
1) Please advise how to manipulate classifier threshold to make classifier more sensetive to a class with a tiny fraction of positive instances vs a class with huge amount of negative instances?
2) Any other MLlib classifiers to solve this problem?
3) What itercept parameter does to the Logistic Regression algorithm?
val model = new LogisticRegressionWithSGD().setIntercept(true).run(training)
Well, I think what you have here is a very unbalance data set problem:
150 000 000 Class1
50 000 Class2. 3000 times smaller.
So if you train a classifier that assumes all are Class1 you are going to have:
0.999666 accuracy. So the best classifier will always be ALL are Class1. This is what your model is learning here.
There are different ways to assess these cases, in general you can do, downsampling the larger Class, or up-sampling the smaller class, or there are some other things you can do with randomforests for example when you sample do it in a balanced way (stratified), or add weights:
http://statistics.berkeley.edu/sites/default/files/tech-reports/666.pdf
Other methods also exist like SMOTE,etc (also doing samples) for more details you can read here:
https://www3.nd.edu/~dial/papers/SPRINGER05.pdf
The threshold you can change for your logistic regression is going to be the probability one, you can try playing with "probabilityCol" in the parameters of the logistic regression example here:
http://spark.apache.org/docs/latest/ml-guide.html
But a problem now with MLlib is that not all classifiers are returning a probability, I asked them about this and it is in their roadmap.

Resources