Keras Fixing Seed effect on traning data shuffling - keras

I'm new in keras and i have one question.
To get reproducible result, i fixed seed. If the fit function shuffle parameter is true, is traning data order always same for all epochs or not?
Thanks in advance.

Yes, if you set the seed correctly to a certain value the training order should always be the same with the same seed. However I there were some problems regarding reproducibility when using TF and multiprocessing. I'm not sure if this is solved by now.
You can also checkout this site in the Keras Documentation.

Related

Skip some layers in Keras model during "Evaluation/Validation Phase"

Hi Stackoverflow community,
I am solving a problem here where I have to skip some layers of a Keras model during the Evaluation/Validation phase. "Cutom_VGG_Model" output should directly go to "Classifier" (since output and input shapes for both are the same, respectively). This would be a new evaluation strategy. Training is performed on the same model without any change. Kindly suggest some ways to solve it.
I found some solution on Keras documentation where they create Custom Model and override the "test_step" method. But I am not able to formulate the code. Below is the link to the documentation I am referring to :
https://keras.io/guides/customizing_what_happens_in_fit/
I am using Tensorflow Keras.
My requirement is to calculate the validation loss on this NEW modified Evaluation strategy after each epoch. Also, the final evaluation will also be performed with this new modified evaluation strategy.
Thanks in advance!

Need help understanding the MLPClassifier

I've been working with the MLPClassifier for a while and I think I had a wrong interpretation of what the function is doing for the whole time and I think I got it right now, but I am not sure about that. So I will summarize my understanding and it would be great if you could add your thoughts on the right understanding.
So with the MLPClassifier we are building a neural network based on a training dataset. Setting early_stopping = True it is possible to use a validation dataset within the training process in order to check whether the network is working on a new set as well. If early_stopping = False, no validation within he process is done. After one has finished building, we can use the fitted model in order to predict on a third dataset if we wish to.
What I was thiking before is, that doing the whole training process a validation dataset is being taken aside anways with validating after every epoch.
I'm not sure if my question is understandable, but it would be great if you could help me to clear my thoughts.
The sklearn.neural_network.MLPClassifier uses (a variant of) Stochastic Gradient Descent (SGD) by default. Your question could be framed more generally as how SGD is used to optimize the parameter values in a supervised learning context. There is nothing specific to Multi-layer Perceptrons (MLP) here.
So with the MLPClassifier we are building a neural network based on a training dataset. Setting early_stopping = True it is possible to use a validation dataset within the training process
Correct, although it should be noted that this validation set is taken away from the original training set.
in order to check whether the network is working on a new set as well.
Not quite. The point of early stopping is to track the validation score during training and stop training as soon as the validation score stops improving significantly.
If early_stopping = False, no validation within [t]he process is done. After one has finished building, we can use the fitted model in order to predict on a third dataset if we wish to.
Correct.
What I was thiking before is, that doing the whole training process a validation dataset is being taken aside anways with validating after every epoch.
As you probably know by now, this is not so. The division of the learning process into epochs is somewhat arbitrary and has nothing to do with validation.

PySpark MLLib Random Forest classifier repeatability issue

I am running into this situation where I have no clue what's going with the PySpark Random Forest classifier. I want the model to be reproducible given the same training data. To do so, I added the seed parameter to an integer value as recommended on this page.
https://spark.apache.org/docs/2.4.1/api/java/org/apache/spark/mllib/tree/RandomForest.html.
This seed parameter is the random seed for bootstrapping and choosing feature subsets. Now, I verified the model and they are absolutely identical. But here's the question.
If I reorder the training data or simply shuffle it and run the training process (with the same seed value) it produces a different model. Can anyone help me understand this behavior? I thought that the seed is used for bootstrapping and choosing feature subsets. If that's the case what is causing this random behavior?
It will be really good to understand this and if anyone out there can help - it will be much appreciated. Thanks.

Pruning in Keras

I'm trying to design a neural network using Keras with priority on prediction performance, and I cannot get sufficiently high accuracy by further reducing the number of layers and nodes per layer. I have noticed that very large portion of my weights are effectively zero (>95%). Is there a way to prune dense layers in hope of reducing prediction time?
Not a dedicated way :(
There's currently no easy (dedicated) way of doing this with Keras.
A discussion is ongoing at https://groups.google.com/forum/#!topic/keras-users/oEecCWayJrM.
You may also be interested in this paper: https://arxiv.org/pdf/1608.04493v1.pdf.
Take a look at Keras Surgeon:
https://github.com/BenWhetton/keras-surgeon
I have not tried it myself, but the documentation claims that it has functions to remove or insert nodes.
Also, after looking at some papers on pruning, it seems that many researchers create a new model with less channels (or less layers), and then copy the weights from the original model to the new model.
See this dedicated tooling for tf.keras. https://www.tensorflow.org/model_optimization/guide/pruning
As the overview suggests, support for latency improvements is a work in progress
Edit: Keras -> tf.keras based on LucG's suggestion.
If you set an individual weight to zero won't that prevent it from being updated during back propagation? Shouldn't thatv weight remain zero from one epoch to the next? That's why you set the initial weights to nonzero values before training. If you want to "remove" an entire node, just set all of the weights on that node's output to zero and that will prevent that nodes from having any affect on the output throughout training.

Can SVC give different results? [scikit-learn v0.14]

I'm noticing that given the same feature table (training data) and feature vector for an SVC, I am getting different results for the predict_proba output.
Is this expected behavior for an SVC or should I be getting consistent results?
Thanks for your help!
I think this is caused by the fact that libsvm is calibrating probabilities using cross-validation on random folds of the dataset. In recent versions of sklearn (0.14.1+), passing the random_state=0 as constructor param should fix the PRNG seed used internally by libsvm. If it does not fix the outcome, please feel free to open github issue with a minimalistic reproduction script.

Resources