Spark : setNumClasses() for a subset of labels for Multiclass LogisticRegressionModel - apache-spark

I have a database with ids (labels) that range from 1 to 1040. I am using the Multiclass Logistic Regression for predciting the id. Now if I want to train only a subset of labels, let's say from 800 to 810. I get an error when I set setNumClasses(11) - for 11 classes. I must always set this method to the Max value of classes, which is 1040. That way the training model will train for all labels from 0 to 1040, and that is very expensive and uses a lot of resources.
Am I understaning this right? How can I train my model only for a subset of labels with giving the setNumClasses(count_of_classes).
final LogisticRegressionModel model = new LogisticRegressionWithLBFGS()
.setNumClasses(811).run(train.rdd());

Based on the comments of previews answer I found the 2nd last comment is the main query. If you set setNumClasses(23) means: in the train set all the classes should be in the range of (0 to 22). Check the (docs). It is written as:
:: Experimental :: Set the number of possible outcomes for k classes classification problem in Multinomial Logistic Regression. By default, it is binary logistic regression so k will be set to 2.
That means, for binary logistic regression, binary values/classes are (0 and 1), so setNumClasses(2), is the default.
In the train set if you have other classes like 2,3,4, for binary classification it will not work.
Proposed Solution: if you have train set or subset contains 790 - 801 and 900 - 910 classes, then normalise or transform your data to (0 to 22) and put 23 as setNumClasses(23).

You cannot do it like this, you are supplying a set of training data and it probably fails somewhere in the gradient descent method in Spark (not sure since you haven't provided the error message).
Also how is Spark supposed to figure out for which 800 labels should it train the model?
What you should do is to filter out only the rows in the RDD with the labels for which you want to train the model. For instance lets say your labels are values from 0 to 1040 and you only want to train for labels 0 to 800 you can do:
val actualTrainingRDD = train.filter( _.label < 801 )
final LogisticRegressionModel model = new LogisticRegressionWithLBFGS()
.setNumClasses(801).run(train.rdd());
#Edit: yes it's of course possible to choose a different set of labels, that was just an example, simply change the filter method to:
train.filter( row => (row.label >= 790 && row.label < 801) )
This is Scala, Java closures use ->, right?

Related

Should I standardize the second datset with the same scaling as in the first dataset?

I am in very much confusion.
I have two datasets. One dataset is considered a source domain (Dataset A) and other dataset is considered a target domain (Dataset B).
First, I standardized each column of Dataset A using mean and standard deviation value of respective columns. I have 600 points in the dataset A. Then I splitted my dataset into Training, Validation and Testing dataset. I trained CNN model and then I tested model using testing dataset. I gives pretty accurate results (prediction).
I have calculated mean and standard deviation of each column available in Dataset A as follow,
thicknessMean = np.mean(thick_SD)
MaxForceMean = np.mean(maxF_SD)
MeanForceMean = np.mean(meanF_SD)
thicknessstd = np.std(thick_SD)
MaxForcestd = np.std(maxF_SD)
MeanForcestd = np.std(meanF_SD)
thick_SD_scaled = (thick_SD - thicknessMean)/thicknessstd
maxF_SD_scaled = (maxF_SD - MaxForceMean)/MaxForcestd
meanF_SD_scaled = (meanF_SD - MeanForceMean)/MeanForcestd
Now, I want to make prediction from the model by feeding the Dataset B. Therefore, I saved the already trained model (with .pth file). Then I standardize the dataset B, but this time I have transformed the dataset using 'mean' and 'standard deviation' of the dataset A. After doing this, I evaluate the already trained model using dataset B. But it is giving a worse prediction.
thick_TD_scaled = (thick_TD - thicknessMean)/thicknessstd
maxF_TD_scaled = (maxF_TD - MaxForceMean)/MaxForcestd
meanF_TD_scaled = (meanF_TD - MeanForceMean)/MeanForcestd
You can see, to scale my dataset B, I have used mean value for eg.thicknessMean and standard deviation for eg. thicknessstd value of the Dataset A .
My question is:
(1) where I am doing wrong? What should I do to make my prediction near to accurate?
(2) When I check prediction's accuracy on two different dataset, should I standardize the second dataset at a same scaling as in the first dataset?

Hardcode a spark logistic regression model

I've trained a model using PySpark and would like to compare its performance to that of an existing heuristic.
I just want to hardcode an LR model with the coefficients 0.1, 0.5, and 0.7, call .transform on the test data to get the predictions, and compute the accuracies.
How do I hardcode a model?
Unfortunately it's not possible to just set the coefficients of a pyspark LR model. The pyspark LR model is actually a wrapper around a java ml model (see class JavaEstimator).
So when the LR model is fit, it transfers the params from the paramMap to a new java estimator, which is fit to the data. All the LogisticRegressionModel methods/attributes are just calls to the java model using the _call_java method.
Since the coefficients aren't params (you can see a comprehensive list using explainParams on a LR instance), you can't pass them to the java LR model that's created, and there is not a setter method.
For example, for a logistic regression model lrm, you can see that the only setters are for the params you can set when you instantiate a pyspark LR instance: lowerBoundsOnCoefficients and upperBoundsOnCoefficients.
print([c for c in lmr._java_obj.__dir__() if "coefficient" in c.lower()])
# >>> ['coefficientMatrix', 'lowerBoundsOnCoefficients',
# 'org$apache$spark$ml$classification$LogisticRegressionParams$_setter_$lowerBoundsOnCoefficients_$eq',
# 'getLowerBoundsOnCoefficients',
# 'org$apache$spark$ml$classification$LogisticRegressionParams$_setter_$upperBoundsOnCoefficients_$eq',
# 'getUpperBoundsOnCoefficients', 'upperBoundsOnCoefficients', 'coefficients',
# 'org$apache$spark$ml$classification$LogisticRegressionModel$$_coefficients']
Trying to set the "coefficients" attribute yields this:
print(lmr.coefficients)
# >>> DenseVector([18.9303, -18.9303])
lmr.coefficients = [10, -10]
# >>> AttributeError: can't set attribute
So you'd have to roll your own pyspark transformer if you want to be able to provide coefficients. It would probably be easier just to calculate results using the standard logistic function as per #pault's comment.
You can set lower and upper bounds on coefficients of a LR model.
In your case when you exactly know what you want - you can set the lower and upper bound coefficients to the same numbers and thats what you will get the same exact coefficients.
You can set the coeffcients as dense matrix like this -
from pyspark.ml.linalg import Vectors,Matrices
a=Matrices.dense(1, 3,[ 0.1,0.5,0.7])
b=Matrices.dense(1, 3,[ 0.1,0.5,0.7])
and incroporate them into the model as hyperparamaters
lr = LogisticRegression(featuresCol='features', labelCol='label', maxIter=10,
lowerBoundsOnCoefficients=a,\
upperBoundsOnCoefficients=b, \
threshold = 0.5)
and voila! you have your model.
You can then call fit & tranform on your model -
best_mod=lr.fit(train)
predict_train=best_mod.transform(train) # train data
predict_test=best_mod.transform(test) # test data

How to forcasting in multiple linear regression model?

I have a air pollution data with 3 parameters. then how can i predict the upto 2040 in multiple linear regression model in R programming
explanatory_data = data.frame(par1 = data$par1, par2 = data$par2)
predictions = predict.lm(model, explanatory_data)
add more parameters to the dataframe if you require.
This is how predictions can be made for multiple regressions.

How can I correctly use Pipleline with MinMaxScaler + NMF to predict data?

This is a very small sklearn snipplet:
logistic = linear_model.LogisticRegression()
pipe = Pipeline(steps=[
('scaler_2', MinMaxScaler()),
('pca', decomposition.NMF(6)),
('logistic', logistic),
])
from sklearn.cross_validation import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.2)
pipe.fit(Xtrain, ytrain)
ypred = pipe.predict(Xtest)
I will get this error:
raise ValueError("Negative values in data passed to %s" % whom)
ValueError: Negative values in data passed to NMF (input X)
According to this question:
Scaling test data to 0 and 1 using MinMaxScaler
I know this is because
This is due to the fact that the lowest value in my test data was
lower than the train data, of which the min max scaler was fit
But I am wondering, is this a bug?
MinMaxScaler (all scalers) seems should be applied before I do the prediction, it should not depends on previous fitted training data, am I right?
Or how could I correctly use preprocessing scalers with Pipeline?
Thanks.
This is not a bug. The main reason that you add the scaler to the pipeline is to prevent leaking the information from your test set to your model. When you fit the pipeline to your training data, the MinMaxScaler keeps the min and max of your training data. It will use these values to scale any other data that it may see for prediction. As you also highlighted, this min and max are not necessarily the min and max of your test data set! Therefore you may end up having some negative values in your training set when the min of your test set is smaller than the min value in the training set. You need a scaler that does not give you negative values. For instance, you may usesklearn.preprocessing.StandardScaler. Make sure that you set the parameter with_mean = False. This way, it will not center the data before scaling but scales your data to unit variance.
If your data is stationary and sampling is done properly, you can assume that your test set resembles your train set to some big extent.
Therefore, you can expect that min/max over test set is close to min/max over train set, with exceptions to few "outliers".
To decrease chances of producing negative values with MinMaxScaler on test set, simply scale your data not to (0,1) range, but ensure that you have allowed some "safety space" for your transformer like this:
MinMaxScaler(feature_range=(1,2))

Spark: Normalising/Stantardizing test-set using training set statistics

This is a very common process in Machine Learning.
I have a dataset and I split it into training set and test set.
Since I apply some normalizing and standardization to the training set,
I would like to use the same info of the training set (mean/std/min/max
values of each feature), to apply the normalizing and standardization
to the test set too. Do you know any optimal way to do that?
I am aware of the functions of MinMaxScaler, StandardScaler etc..
You can achieve this via a few lines of code on both the training and test set.
On the training side there are two approaches:
MultivariateStatisticalSummary
http://spark.apache.org/docs/latest/mllib-statistics.html
val summary: MultivariateStatisticalSummary = Statistics.colStats(observations)
println(summary.mean) // a dense vector containing the mean value for each column
println(summary.variance) // column-wise variance
println(summary.numNonzeros) // number of nonzeros in each
Using SQL
from pyspark.sql.functions import mean, min, max
In [6]: df.select([mean('uniform'), min('uniform'), max('uniform')]).show()
+------------------+-------------------+------------------+
| AVG(uniform)| MIN(uniform)| MAX(uniform)|
+------------------+-------------------+------------------+
|0.5215336029384192|0.19657711634539565|0.9970412477032209|
+------------------+-------------------+------------------+
On the testing data - you can then manually "normalize the data using the statistics obtained above from the training data. You can decide in which sense you wish to normalize: e.g.
Student's T
val normalized = testData.map{ m =>
(m - trainMean) / trainingSampleStddev
}
Feature Scaling
val normalized = testData.map{ m =>
(m - trainMean) / (trainMax - trainMin)
}
There are others: take a look at https://en.wikipedia.org/wiki/Normalization_(statistics)

Resources