I understand the iteration training process of finding the optimized weight to explain the dataset.
However, before studying about learning iteration, I used to apply simple code like lm(Y~x) or LinearRegression().fit(X, y) without multiple loops of training process.
Does this simple code already contain iteration of training process?
What is the difference between simple algorithm without learning iteration and the one with iteration?
Thanks
Related
I've been working with the MLPClassifier for a while and I think I had a wrong interpretation of what the function is doing for the whole time and I think I got it right now, but I am not sure about that. So I will summarize my understanding and it would be great if you could add your thoughts on the right understanding.
So with the MLPClassifier we are building a neural network based on a training dataset. Setting early_stopping = True it is possible to use a validation dataset within the training process in order to check whether the network is working on a new set as well. If early_stopping = False, no validation within he process is done. After one has finished building, we can use the fitted model in order to predict on a third dataset if we wish to.
What I was thiking before is, that doing the whole training process a validation dataset is being taken aside anways with validating after every epoch.
I'm not sure if my question is understandable, but it would be great if you could help me to clear my thoughts.
The sklearn.neural_network.MLPClassifier uses (a variant of) Stochastic Gradient Descent (SGD) by default. Your question could be framed more generally as how SGD is used to optimize the parameter values in a supervised learning context. There is nothing specific to Multi-layer Perceptrons (MLP) here.
So with the MLPClassifier we are building a neural network based on a training dataset. Setting early_stopping = True it is possible to use a validation dataset within the training process
Correct, although it should be noted that this validation set is taken away from the original training set.
in order to check whether the network is working on a new set as well.
Not quite. The point of early stopping is to track the validation score during training and stop training as soon as the validation score stops improving significantly.
If early_stopping = False, no validation within [t]he process is done. After one has finished building, we can use the fitted model in order to predict on a third dataset if we wish to.
Correct.
What I was thiking before is, that doing the whole training process a validation dataset is being taken aside anways with validating after every epoch.
As you probably know by now, this is not so. The division of the learning process into epochs is somewhat arbitrary and has nothing to do with validation.
I am working to compare two DL algorithms in terms of computation time. How I can calculate the exact execution time in one Iteration?
Take a look at the Tensorflow Profiler. It works with Tensorboard and you don't need to change much to your code.
I am pretty new to Tensorflow, and I am currently learning it through given website https://www.tensorflow.org/get_started/get_started
It is said in the manual that:
We've created a model, but we don't know how good it is yet. To evaluate the model on training data, we need a y placeholder to provide the desired values, and we need to write a loss function.
A loss function measures how far apart the current model is from the provided data. We'll use a standard loss model for linear regression, which sums the squares of the deltas between the current model and the provided data. linear_model - y creates a vector where each element is the corresponding example's error delta. We call tf.square to square that error. Then, we sum all the squared errors to create a single scalar that abstracts the error of all examples using tf.reduce_sum:"
q1."we don't know how good it is yet.", I didn't understand this
quote as the simple model created is a simple slope equation and on
what it should train for?, as the model is a simple slope. Is it
require an perfect slope or what? why am I training that model and
for what?
q2.what is a loss function? Is loss function is used to determine the
accuracy of the model? Why is it required?
q3. I didn't understand " 'sums the squares of the deltas' between
the current model and the provided data."
q4.I didn't understood this part of code,"squared_deltas =
tf.square(linear_model - y)
this is the code:
y = tf.placeholder(tf.float32)
squared_deltas = tf.square(linear_model - y)
loss = tf.reduce_sum(squared_deltas)
print(sess.run(loss, {x:[1,2,3,4], y:[0,-1,-2,-3]}))
this may be simple questions, but I am a beginner to Tensorflow and having a hard time understanding it.
1) So you're kind of right about "Why should we train for a simple problem" but this is just an introduction piece. With any machine learning task you need to evaluate your model to see how good it is. In this case you are just trying to train to find the coefficients for the line of best fit.
2) A loss function in any machine learning context represents your error with your model. This usually means a function of your "distance" of your calculated value to the ground truth value. Think of it as an internal evaluation score. You want to minimise your loss so the gradients and parameter changes are based on your loss.
3/4) Your question here is more to do with least square regression. It's a statistical method to create lines of best fit between points. The deltas represent the differences between your calculated values and the truth values. The aim is to minimise the area of the squares and hence minise the error and have a better line of best fit.
What you are doing in this Tensorflow example is creating a machine learning model that will learn the coefficients for the line of best fit automatically using a least squares based system.
Pretty much all of your question have to-do with the loss function.
The loss function is a function that determines how far apart your output are from the expected (correct) output.
It has two usages:
Help the algorithm determine if the tweaking of the weight is helping going in the good or bad direction
Determinate the accuracy (~the number of time your system guesses the correct answer)
The loss function is the sum of the deltas witch is: the addition of the diff (delta) between the expected output and the actual output.
I think It's squared to magnifies the error the algorithm makes.
I'm looking at this example from scikit-learn documentation: http://scikit-learn.org/0.18/auto_examples/model_selection/plot_nested_cross_validation_iris.html
It seems to me that crossvalidation is not performed in an unbiased way here. Both GridSearchCV (supposedly the inner CV loop) and cross_val_score (supposedly the outer CV loop) are using the same data and the same folds. Therefore there is an overlap between the data the classifier was trained on and evaluated with. What am I getting wrong?
#Gael - As I cannot add a comment, I am posting this in the answer section. I am not sure what Gael means by "the first split is done inside cross_val_score, and the second split is done inside GridSearchCV (that the whole point of the GridSearchCV object)". Are you trying to imply that the cross_val_score function passes the (k-1)-fold data (used for training in outer loop) to the clf object ? That does not appear to be the case, as I can comment out the cross_val_score function and just set nested_score[i] to a dummy variable, and still obtain the exact same clf.best_score_. This implies that the GridSearchCV is evaluated separately and does use all available data, and not a subset of training data.
In nested CV, to the best of my understanding, the idea is that the inner loop will do the hyper-parameter search on a smaller subset of training data, and then the outer loop will use these parameters to do a cross-validation. One of the reasons for using smaller training data in the inner loop is to avoid information leakage. It doesn't appear that's what is happening here. The inner loop is first using all the data to search for hyper-parameters, which are then used for cross-validation in the outer loop. Thus, the inner loop has already seen all data and any testing done in the outer loop will suffer from information leakage. If I am mistaken, could you please point me to the section of code which you are referring to in your answer ?
Totally agree, that nested-cv procedure is wrong, cross_val_score is taken the best hyperparameters computed by GridSearchCV and computing a cv score using such hyperparameters. In nested-cv, you need the outer loop for assessing model performance and the inner loop for model selection, such that, the portion of data used in the inner loop for model selection must not be the same used for assessing model performance. An example will be a LOOCV outer loop for assessing performance (or, it will be a 5cv, 10cv, or whatever you like) and a 10cv-fold for model selection with grid search in the inner loop. That means that, if you have N observations then you will perform model selection in the inner loop (using grid search and 10-CV, for example) on the N-1 observations, and you will asses the model performance on the LOO observation (or in the hold-out data sample if you choose another approach).
(Note that you are estimating N best models in the sense of hyperparameters internally) .
it will be helpful to have access to the link of the code of cross_val_score and GridSearchCV.
Some references for nested CV are:
Christophe Ambroise and Georey J McLachlan. Selection bias in gene extraction on the basis of microarray gene-expression data. Proceedings of the national academy of sciences 99, 10 (2002), 6562 - 6566.
Gavin C Cawley and Nicola LC Talbot. On overfitting in model selection and subsequent selection bias in performance evaluation. Journal of Machine Learning Research 11, Jul (2010), 2079{2107.
Note:
I did not find anything in the documentation of cross_val_score indicating that internally the hyperparameters are optimized using parameter search, grid search + cross-validation for example, on the k-1 folds of data, and using those optimized parameters on the hold-out data sample (what I am saying is different to the code in http://scikit-learn.org/dev/auto_examples/model_selection/plot_nested_cross_validation_iris.html)
They are not using the same data. Granted, the code of the example does not make it apparent, because the splits are not visible: the first split is done inside cross_val_score, and the second split is done inside GridSearchCV (that the whole point of the GridSearchCV object). Using functions and objects rather than hand-written for loops may make things less transparent, but it:
Enables reuses
Adds many "little things" that would render the for loop tedious, such as parallel computing, support for different scoring function, etc.
Is actually safer in terms of avoid data leakage because our splitting code has been audited many many times.
If you are not convinced, take a look at the code of cross_val_score and GridSearchCV.
The example was improved recently to specify this in the comments:
http://scikit-learn.org/dev/auto_examples/model_selection/plot_nested_cross_validation_iris.html
(pull request on https://github.com/scikit-learn/scikit-learn/pull/7949 )
I am using AForge.NET ANN and training it on my training set. Because the training is single threaded and the process can take ages, I wondered if it's possible to run a multi threaded training.
Because it is a problem to use threads while training a Resilient Backpropagation network I thought about splitting my training set between different networks and once every N epoch's, combine the weights of all networks in to one, Then, duplicate it to all threads (so the next epoch will start with the new weights).
I can't seem to find a method in the AForge.NET that combines two (or more) networks. Looking for some help on how to get started with the implementation process.
Combining the neural networks every N number of iterations won't work really well. It can be very tricky to just take the weights and combine them. In some ways this is how the crossover operation of a Genetic Algorithm works.
Really the only way you are going to be able to do this is modify AForge's training to support multiple threads. Basically to do this you need to map the gradient calculation and then do a reduce-sum on the gradients. Then use the reduced gradients to update the network.
I've implemented this exact thing in the Encog Framework, it supports multi-threaded (RPROP), and has a C# version. http://www.heatonresearch.com/encog.