How to forcasting in multiple linear regression model? - statistics

I have a air pollution data with 3 parameters. then how can i predict the upto 2040 in multiple linear regression model in R programming

explanatory_data = data.frame(par1 = data$par1, par2 = data$par2)
predictions = predict.lm(model, explanatory_data)
add more parameters to the dataframe if you require.
This is how predictions can be made for multiple regressions.

Related

Should I standardize the second datset with the same scaling as in the first dataset?

I am in very much confusion.
I have two datasets. One dataset is considered a source domain (Dataset A) and other dataset is considered a target domain (Dataset B).
First, I standardized each column of Dataset A using mean and standard deviation value of respective columns. I have 600 points in the dataset A. Then I splitted my dataset into Training, Validation and Testing dataset. I trained CNN model and then I tested model using testing dataset. I gives pretty accurate results (prediction).
I have calculated mean and standard deviation of each column available in Dataset A as follow,
thicknessMean = np.mean(thick_SD)
MaxForceMean = np.mean(maxF_SD)
MeanForceMean = np.mean(meanF_SD)
thicknessstd = np.std(thick_SD)
MaxForcestd = np.std(maxF_SD)
MeanForcestd = np.std(meanF_SD)
thick_SD_scaled = (thick_SD - thicknessMean)/thicknessstd
maxF_SD_scaled = (maxF_SD - MaxForceMean)/MaxForcestd
meanF_SD_scaled = (meanF_SD - MeanForceMean)/MeanForcestd
Now, I want to make prediction from the model by feeding the Dataset B. Therefore, I saved the already trained model (with .pth file). Then I standardize the dataset B, but this time I have transformed the dataset using 'mean' and 'standard deviation' of the dataset A. After doing this, I evaluate the already trained model using dataset B. But it is giving a worse prediction.
thick_TD_scaled = (thick_TD - thicknessMean)/thicknessstd
maxF_TD_scaled = (maxF_TD - MaxForceMean)/MaxForcestd
meanF_TD_scaled = (meanF_TD - MeanForceMean)/MeanForcestd
You can see, to scale my dataset B, I have used mean value for eg.thicknessMean and standard deviation for eg. thicknessstd value of the Dataset A .
My question is:
(1) where I am doing wrong? What should I do to make my prediction near to accurate?
(2) When I check prediction's accuracy on two different dataset, should I standardize the second dataset at a same scaling as in the first dataset?

Spark ALS gives the same output

There is a need to create a little bit ensemble of Pyspark ALS Recommender Systems when I found that The factor matrices in ALS are initialized randomly firstly, so different runs will give slightly different results and using mean of them gives more accurate results. So I train model 2 times --> it gives me different model ALS objects but when using recommendForAllUsers() method gives for different models the same recommendation outputs. What is wrong here and Why is needed to restart script to get the different outputs even having different predicted ALS models?
P.S Seed parameter for pseudo random is absent.
def __train_model(ratings):
"""Train the ALS model with the current dataset
"""
logger.info("Training the ALS model...")
als = ALS(rank=rank, maxIter=iterations, implicitPrefs=True, regParam=regularization_parameter,
userCol="order_id", itemCol="product_id", ratingCol="count")
model = als.fit(ratings)
logger.info("ALS model built!")
return model
model1 = __train_model(ratings_DF)
print(model1)
sim_table_1 = model1.recommendForAllUsers(100).toPandas()
model2 = __train_model(ratings_DF)
print(model2)
sim_table_2 = model2.recommendForAllUsers(100).toPandas()
print('Equality of objects:', model1 == model2)
Output:
INFO:__main__:Training the ALS model...
INFO:__main__:ALS model built!
ALS_444a9e62eb6938248b4c
INFO:__main__:Training the ALS model...
INFO:__main__:ALS model built!
ALS_465c95728272696c6c67
Equality of objects: False
If you don't provide a value for the seed parameter when instantiating an ALS instance, it will default to the same value every time since it's a hash of the string ("ALS"). That's why your recommendation is always the same.
Code for setting default of seed:
self._setDefault(seed=hash(type(self).__name__))
Example:
from pyspark.ml.recommendation import ALS
als1 = ALS(rank=10, maxIter=5)
als2 = ALS(rank=10, maxIter=5)
als1.getSeed() == als2.getSeed() == hash("ALS")
>>> True
If you want to get a different model every time, you can use something like numpy.random.randint to generate a random integer for the seed.

Spark/Pyspark: SVM - How to get Area-under-curve?

I have been dealing with random forest and naive bayes lately. Now i want to use a Support vector machine.
After fitting the model i wanted to use the output columns "probability" and "label" to compute the AUC value. But now I have seen that there is no column "probability" for SVM?!
Here you can see how I have done so far:
from pyspark.ml.classification import LinearSVC
svm = LinearSVC(maxIter=5, regParam=0.01)
model = svm.fit(train)
scores = model.transform(train)
results = scores.select('probability', 'label')
# Create Score-Label Set for 'BinaryClassificationMetrics'
results_collect = results.collect()
results_list = [(float(i[0][0]), 1.0-float(i[1])) for i in results_collect]
scoreAndLabels = sc.parallelize(results_list)
metrics = BinaryClassificationMetrics(scoreAndLabels)
print("AUC-value: " + str(round(metrics.areaUnderROC,4)))
That was my approach how I have done this in the past for random forest and naive bayes. I thought I could do it with svm too... But that does not work because there is no output column "probability".
Does anyone know why the column "probability" does not exist? And how i can compute the AUC-value now?
Using the most recent spark/pyspark to the time of this answer:
If you use the pyspark.ml module (unlike mllib), you can work with Dataframe as the interface:
svm = LinearSVC(maxIter=5, regParam=0.01)
model = svm.fit(train)
test_prediction = model.transform(test)
Create the evaluator (see it's source code for settings):
from pyspark.ml.evaluation import BinaryClassificationEvaluator
evaluator = BinaryClassificationEvaluator()
Apply evaluator to data (again, source code shows more options):
evaluation = evaluator.evaluate(test_prediction)
The result of evaluate is, by default, the "Area Under Curve":
print("evaluation (area under ROC): %f" % evaluation)
SVM algorithm doesn't provide probability estimates, but only some scores.
There is an algorithm proposed by Platt to compute probabilities given SVM scores, but it's criticized but some and apparently not implemented in Spark.
Btw, there was a similar question What does the score of the Spark MLLib SVM output mean?

LDA model prediction nonconsistance

I trained a LDA model and load it into the environment to transform the new data:
from pyspark.ml.clustering import LocalLDAModel
lda = LocalLDAModel.load(path)
df = lda.transform(text)
The model will add a new column called topicDistribution. In my opinion, this distribution should be same for the same input, otherwise this model is not consistent. However, it is not in practice.
May I ask the reason why and how to fix it?
LDA uses randomness when training and, depending on the implementation, when infering new data. The implementation in Spark is based on EM MAP inference so I believe it only uses randomness when training the model. This means that the results will be different each time the algorithm is trained and run.
To get the same results when running on the same input and same parameters, you can set the random seed when training the model. For example, to set the random seed to 1:
model = LDA.train(data, k=2, seed=1)
To set the seed when transforming new data, create a parameter map to overwrite the default value (None for seed).
lda = LocalLDAModel.load(path)
paramMap[lda.seed] = 1L
df = lda.transform(text, paramMap)
For more information about overwriting model parameters, see here.

Spark : setNumClasses() for a subset of labels for Multiclass LogisticRegressionModel

I have a database with ids (labels) that range from 1 to 1040. I am using the Multiclass Logistic Regression for predciting the id. Now if I want to train only a subset of labels, let's say from 800 to 810. I get an error when I set setNumClasses(11) - for 11 classes. I must always set this method to the Max value of classes, which is 1040. That way the training model will train for all labels from 0 to 1040, and that is very expensive and uses a lot of resources.
Am I understaning this right? How can I train my model only for a subset of labels with giving the setNumClasses(count_of_classes).
final LogisticRegressionModel model = new LogisticRegressionWithLBFGS()
.setNumClasses(811).run(train.rdd());
Based on the comments of previews answer I found the 2nd last comment is the main query. If you set setNumClasses(23) means: in the train set all the classes should be in the range of (0 to 22). Check the (docs). It is written as:
:: Experimental :: Set the number of possible outcomes for k classes classification problem in Multinomial Logistic Regression. By default, it is binary logistic regression so k will be set to 2.
That means, for binary logistic regression, binary values/classes are (0 and 1), so setNumClasses(2), is the default.
In the train set if you have other classes like 2,3,4, for binary classification it will not work.
Proposed Solution: if you have train set or subset contains 790 - 801 and 900 - 910 classes, then normalise or transform your data to (0 to 22) and put 23 as setNumClasses(23).
You cannot do it like this, you are supplying a set of training data and it probably fails somewhere in the gradient descent method in Spark (not sure since you haven't provided the error message).
Also how is Spark supposed to figure out for which 800 labels should it train the model?
What you should do is to filter out only the rows in the RDD with the labels for which you want to train the model. For instance lets say your labels are values from 0 to 1040 and you only want to train for labels 0 to 800 you can do:
val actualTrainingRDD = train.filter( _.label < 801 )
final LogisticRegressionModel model = new LogisticRegressionWithLBFGS()
.setNumClasses(801).run(train.rdd());
#Edit: yes it's of course possible to choose a different set of labels, that was just an example, simply change the filter method to:
train.filter( row => (row.label >= 790 && row.label < 801) )
This is Scala, Java closures use ->, right?

Resources