I am running into this situation where I have no clue what's going with the PySpark Random Forest classifier. I want the model to be reproducible given the same training data. To do so, I added the seed parameter to an integer value as recommended on this page.
https://spark.apache.org/docs/2.4.1/api/java/org/apache/spark/mllib/tree/RandomForest.html.
This seed parameter is the random seed for bootstrapping and choosing feature subsets. Now, I verified the model and they are absolutely identical. But here's the question.
If I reorder the training data or simply shuffle it and run the training process (with the same seed value) it produces a different model. Can anyone help me understand this behavior? I thought that the seed is used for bootstrapping and choosing feature subsets. If that's the case what is causing this random behavior?
It will be really good to understand this and if anyone out there can help - it will be much appreciated. Thanks.
Related
while frequently running the machine learning algorithms the accuracy is changing in that case how to select the best fit algorithm for that particular data set.
You should definitely provide more details. It's impossible to suggest anything without the domain, the model architecture, hyperparameters.
I guess you are complaining due to changing of accuracy of the model. I think you should set seeds for randomized parameters so that accuracy don't change while training different times and you can reproduce your results.
numpy.random.seed(1)
random.seed(1)
tf.random.set_random_seed(1) # if using tensorflow
Lets assume , the question is for the same data set X (Training), everytime when we run the accuracy by comparing the predicted responses against our Testdata Dependent values(Y) .
If the accuracy keeps changing if we run the model seems, the issue is Sampling Bias ( the division of Training and Test data upholds a mystery).
When you import train_test_split function , use the random_state attribute wisely to keep the test data representative for the overall population of data.
I'm new in keras and i have one question.
To get reproducible result, i fixed seed. If the fit function shuffle parameter is true, is traning data order always same for all epochs or not?
Thanks in advance.
Yes, if you set the seed correctly to a certain value the training order should always be the same with the same seed. However I there were some problems regarding reproducibility when using TF and multiprocessing. I'm not sure if this is solved by now.
You can also checkout this site in the Keras Documentation.
Is there a way to get the significance level of each coefficient we receive after we fit a logistic regression model on training data?
I was trying to find out a way and could not figure out myself.
I think I may get the significance level of each feature if I run chi sq test but first of all not sure if I can run the test on all features together and secondly I have numeric data value so if it will give me right result or not that remains a question as well.
Right now I am running the modeling part using statsmodel and scikit learn but certainly, want to know, how can I get these results from PySpark ML or MLLib itself
If anyone can shed some light, it will be helpful
I use only mllib, I think that when you train a model you can use toPMML method to export your model un PMML format (xml file), then you can parse the xml file to get features weights, here an example
https://spark.apache.org/docs/2.0.2/mllib-pmml-model-export.html
Hope that will help
I'm using spark.mllib.classification.{LogisticRegressionModel, LogisticRegressionWithSGD} and spark.mllib.tree.RandomForest for classification. Using these packages I produce classification models. Only these models predict a specific class per instance. In Weka, we can get the exact probability for each instance to be of each class. How can we do it using these packages?
In LogisticRegressionModel we can set the threshold. So I've created a function that check the results for each point on a different threshold. But this cannot be done for RandomForest (see How to set cutoff while training the data in Random Forest in Spark)
Unfortunately, with MLLIb you can't get the probabilities per instance for classification models till version 1.4.1.
There is JIRA issues (SPARK-4362 and SPARK-6885) concerning this exact topic which is IN PROGRESS as I'm writing the answer now. Nevertheless, the issue seems to be on hold since November 2014
There is currently no way to get the posterior probability of a prediction with Naive Baye's model during prediction. This should be made available along with the label.
And here is a note from #sean-owen on the mailing list on a similar topic regarding the Naive Bayes classification algorithm:
This was recently discussed on this mailing list. You can't get the probabilities out directly now, but you can hack a bit to get the internal data structures of NaiveBayesModel and compute it from there.
Reference : source.
MAJOR EDIT: This issue has been resolved with Spark 1.5.0. Please refer to the JIRA issue for more details.
I'm noticing that given the same feature table (training data) and feature vector for an SVC, I am getting different results for the predict_proba output.
Is this expected behavior for an SVC or should I be getting consistent results?
Thanks for your help!
I think this is caused by the fact that libsvm is calibrating probabilities using cross-validation on random folds of the dataset. In recent versions of sklearn (0.14.1+), passing the random_state=0 as constructor param should fix the PRNG seed used internally by libsvm. If it does not fix the outcome, please feel free to open github issue with a minimalistic reproduction script.