Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
How do I tune the parameters for a neural network such as the amount of layers, the types of layers, the width, etc.? Right now I simply guess for good parameters. This becomes very expensive and time consuming for me as I will tune a network and then find out that it didn't do any better than the previous model. Is there better way to tune the model to get a good test and validation score?
It is totally hit and trail method. You have to play around it. There is no particular method to do that. Try to use GPU instead to CPU to compute fast such as "Google Colab". My suggestion is note down all the parameters that can be tunned.
eg:
Optimizer: Try to use different optimizer such as Adam,SGD,many more
learning rate : This is a very crucial parameter, try to change it from .0001 to 0.001 step by 0.0001.
Number of hidden layers : Try to increase no. of hidden layers.
Try to use Batch Normalization or Drop out or both if required.
Use correct Loss funtion.
Change Batch size and Epoch.
Hidden layers, Epochs, batch-size: Try different numbers.
Optimizers: Adam (gives better results), Rmsprop
Dropout: 0.2 works well in most case
Also as a plus, you should also try different activation functions ( like you can use ReLu in the hidden layers and for output layer use sigmoid for binary class classification and softmax for multiclass classification.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I have searched through this forum for similar questions but was unanswered (Updating Tensorflow Object detection model with new images). I have managed to create my custom train model (lets name it model1). Was wondering if can i use new images that are processed by model1 to further train model1? will it improve the accuracy of the model?
Accuracy will depend on the number of correctly classified images and not only on the total number of training images. https://developers.google.com/machine-learning/crash-course/classification/accuracy. If you consider that the new images are to be used for training (have correct labels), then you should consider re-training the model. Take a look at this post https://datascience.stackexchange.com/questions/12761/should-a-model-be-re-trained-if-new-observations-are-available
You can use your current model (model1) in a number of ways:
on new images to detect bad results (hard examples) for new training
on new images to detect good results for evaluation
on the images in the existing dataset to detect bad images (wrong label etc.)
Some of the bad results from new images will be non-objects (adversarial) and not directly usable for training (but see this: https://github.com/tensorflow/models/issues/3578#issuecomment-375267920).
Removal of bad images from the existing dataset requires retraining from scratch unless there is some funky way of "untraining" images from a model.
Eventually one would end up approaching a perfect dataset that makes best use of the capacity of the chosen model architecture, although the domain may evolve over time.
I think the reason this is not much discussed is because most researchers have to work with common datasets so they can compare their approaches (brilliant read: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5697567/).
It might improve it but it is tricky. It would lead to overfitting. Improving the data set would actually help, but not with images detected by its own model. This kind of images are detected cause the model already performs well on them, so not much help.
What you need actually is quite the opposite. You need to teach the model to recognize the images that it didn't recognize before
The main problem of machine learning (that is the approach you are using for object detection here) is that of generalization. In your case, it is the ability to recognize objects of the same type as image you used for training, in images that were not used during training.
Obviously, if you were able to use all the possible images during training, your system would be perfect (actually, it would be a simple exact image matching problem). In a more realistic setup, the more training image you are using, the higher chance you have to obtain a better object detector.
Usually, it is however more valuable to add hard examples to your training set. Hence, if your application allows it (in terms of computation time in particular) you can indeed add all the images that are wrongly detected in your dataset (with the correct label) and it will probably help to get a better model, able to detect the object in harder condition on new images.
However, it really depends on what you are doing. If you want to compare your system to another one, you need to use the same (training and) test images to be fair. For benchmarking, you are not allowed to include test images in the training dataset! When you compute the accuracy (on a validation/test dataset) to compare several settings, be sure you are fair in this comparison.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
I am working myself through a book on machine learning right now.
Working on a NaiveBayesClassifier the author is very much in favour of the cross-validation method.
He proposes to split the data into ten buckets (files) and train on nine of them each time withholding a different bucket.
Up to now the only approach I am familiar with is to split the data into a training set and a test set in the ratio of 50%/50% and simply train the classifier all at once.
Could someone explain what are possible advantages of using cross-validation?
Cross-validation is a way to address the tradeoff between bias and variance.
When you obtain a model on a training set, your goal is to minimize variance. You can do this by adding more terms, higher order polynomials, etc.
But your true objective is to predict outcomes for points that your model has never seen. That's what holding out the test set simulates.
You'll create your model on a training set, then try it out on a test set. You will find that there's a minimum combination of variance and bias that will give your best results. The simplest model that minimizes both should be your choice.
I'd recommend "An Intro to Statistical Learning" or "Elements of Statistical Learning" by Hastie and Tibshirani for more details.
The general objective of machine learning is that the more the data you have for training, the better results you would get. This was important to state before i start answering the question.
Cross-validation helps us to avoid overfitting of the model and it also helps to increase the generalization accuracy which is the accuracy of the model on unseen future point. Now when you divide your dataset into dtrain and dtest there is one problem with that which is if your function that would be determined once you have trained your model needs both training and testing data, then you cannot say your accuracy on future unseen point would be same as the accuracy you got on your test data. This above argument can be stated by taking the example of k-nn where nearest neighbour is determined with the help of training data while the value of k is determined by test data.
But if you use CV then k could be determined by CV data and your test data can be considered as the unseen data point.
Now suppose you divide your dataset into 3 parts Dtrain(60%), Dcv(20%) and Dtest(20%). Now you have only 60% of data to train with. Now suppose you want to use all the 80% of your data to train then you can do this with the help of m-fold cross validation. In m-fold CV you divide your data into two parts Dtrain and Dtest (lets say 80 and 20).
lets say the value of m is 4 so you divide the training data into 4 equal parts randomly (d1,d2,d3,d4). Now start training the model by taking d1,d2,d3 as dtrain and d4 as cv and calculate the accuracy, in the next go take d2,d3,d4 as dtrain and d1 as cv and likewise take all possiblity for m=1, then for m=2 continue the same procedure. With this you use your entire 80% of your dtrain and your dtest can be considered as future unseen dataset.
The advantages are better and more use of your dtrain data, reduce the overfit and helps to give you sured generalization accuracy.
but on the downside the time complexity is high.
In your case the value of m is 10.
Hope this helps.
The idea is to have maximum no. of points to train the model to achieve accurate results.For every data point chosen in the train set, it is excluded from the test set. So we use the concept of k and k-1 where we firstly divide the data set into equal k sized bins and we take one bin make it a test set and the remaining k-1 bins represent train set. We repeat the process till all the bins are selected once as test set(k) and the remaining as training(k-1).Doing this no data point in missed out for training purpose
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 10 months ago.
Improve this question
I have a problem where I am trying to create a neural network for Tic-Tac-Toe. However, for some reason, training the neural network causes it to produce nearly the same output for any given input.
I did take a look at Artificial neural networks benchmark, but my network implementation is built for neurons with the same activation function for each neuron, i.e. no constant neurons.
To make sure the problem wasn't just due to my choice of training set (1218 board states and moves generated by a genetic algorithm), I tried to train the network to reproduce XOR. The logistic activation function was used. Instead of using the derivative, I multiplied the error by output*(1-output) as some sources suggested that this was equivalent to using the derivative. I can put the Haskell source on HPaste, but it's a little embarrassing to look at. The network has 3 layers: the first layer has 2 inputs and 4 outputs, the second has 4 inputs and 1 output, and the third has 1 output. Increasing to 4 neurons in the second layer didn't help, and neither did increasing to 8 outputs in the first layer.
I then calculated errors, network output, bias updates, and the weight updates by hand based on http://hebb.mit.edu/courses/9.641/2002/lectures/lecture04.pdf to make sure there wasn't an error in those parts of the code (there wasn't, but I will probably do it again just to make sure). Because I am using batch training, I did not multiply by x in equation (4) there. I am adding the weight change, though http://www.faqs.org/faqs/ai-faq/neural-nets/part2/section-2.html suggests to subtract it instead.
The problem persisted, even in this simplified network. For example, these are the results after 500 epochs of batch training and of incremental training.
Input |Target|Output (Batch) |Output(Incremental)
[1.0,1.0]|[0.0] |[0.5003781562785173]|[0.5009731800870864]
[1.0,0.0]|[1.0] |[0.5003740346965251]|[0.5006347214672715]
[0.0,1.0]|[1.0] |[0.5003734471544522]|[0.500589332376345]
[0.0,0.0]|[0.0] |[0.5003674110937019]|[0.500095157458231]
Subtracting instead of adding produces the same problem, except everything is 0.99 something instead of 0.50 something. 5000 epochs produces the same result, except the batch-trained network returns exactly 0.5 for each case. (Heck, even 10,000 epochs didn't work for batch training.)
Is there anything in general that could produce this behavior?
Also, I looked at the intermediate errors for incremental training, and the although the inputs of the hidden/input layers varied, the error for the output neuron was always +/-0.12. For batch training, the errors were increasing, but extremely slowly and the errors were all extremely small (x10^-7). Different initial random weights and biases made no difference, either.
Note that this is a school project, so hints/guides would be more helpful. Although reinventing the wheel and making my own network (in a language I don't know well!) was a horrible idea, I felt it would be more appropriate for a school project (so I know what's going on...in theory, at least. There doesn't seem to be a computer science teacher at my school).
EDIT: Two layers, an input layer of 2 inputs to 8 outputs, and an output layer of 8 inputs to 1 output, produces much the same results: 0.5+/-0.2 (or so) for each training case. I'm also playing around with pyBrain, seeing if any network structure there will work.
Edit 2: I am using a learning rate of 0.1. Sorry for forgetting about that.
Edit 3: Pybrain's "trainUntilConvergence" doesn't get me a fully trained network, either, but 20000 epochs does, with 16 neurons in the hidden layer. 10000 epochs and 4 neurons, not so much, but close. So, in Haskell, with the input layer having 2 inputs & 2 outputs, hidden layer with 2 inputs and 8 outputs, and output layer with 8 inputs and 1 output...I get the same problem with 10000 epochs. And with 20000 epochs.
Edit 4: I ran the network by hand again based on the MIT PDF above, and the values match, so the code should be correct unless I am misunderstanding those equations.
Some of my source code is at http://hpaste.org/42453/neural_network__not_working; I'm working on cleaning my code somewhat and putting it in a Github (rather than a private Bitbucket) repository.
All of the relevant source code is now at https://github.com/l33tnerd/hsann.
I've had similar problems, but was able to solve by changing these:
Scale down the problem to manageable size. I first tried too many inputs, with too many hidden layer units. Once I scaled down the problem, I could see if the solution to the smaller problem was working. This also works because when it's scaled down, the times to compute the weights drop down significantly, so I can try many different things without waiting.
Make sure you have enough hidden units. This was a major problem for me. I had about 900 inputs connecting to ~10 units in the hidden layer. This was way too small to quickly converge. But also became very slow if I added additional units. Scaling down the number of inputs helped a lot.
Change the activation function and its parameters. I was using tanh at first. I tried other functions: sigmoid, normalized sigmoid, Gaussian, etc.. I also found that changing the function parameters to make the functions steeper or shallower affected how quickly the network converged.
Change learning algorithm parameters. Try different learning rates (0.01 to 0.9). Also try different momentum parameters, if your algo supports it (0.1 to 0.9).
Hope this helps those who find this thread on Google!
So I realise this is extremely late for the original post, but I came across this because I was having a similar problem and none of the reasons posted here cover what was wrong in my case.
I was working on a simple regression problem, but every time I trained the network it would converge to a point where it was giving me the same output (or sometimes a few different outputs) for each input. I played with the learning rate, the number of hidden layers/nodes, the optimization algorithm etc but it made no difference. Even when I looked at a ridiculously simple example, trying to predict the output (1d) of two different inputs (1d):
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
class net(nn.Module):
def __init__(self, obs_size, hidden_size):
super(net, self).__init__()
self.fc = nn.Linear(obs_size, hidden_size)
self.out = nn.Linear(hidden_size, 1)
def forward(self, obs):
h = F.relu(self.fc(obs))
return self.out(h)
inputs = np.array([[0.5],[0.9]])
targets = torch.tensor([3.0, 2.0], dtype=torch.float32)
network = net(1,5)
optimizer = torch.optim.Adam(network.parameters(), lr=0.001)
for i in range(10000):
out = network(torch.tensor(inputs, dtype=torch.float32))
loss = F.mse_loss(out, targets)
optimizer.zero_grad()
loss.backward()
optimizer.step()
print("Loss: %f outputs: %f, %f"%(loss.data.numpy(), out.data.numpy()[0], out.data.numpy()[1]))
but STILL it was always outputting the average value of the outputs for both inputs. It turns out the reason is that the dimensions of my outputs and targets were not the same: the targets were Size[2], and the outputs were Size[2,1], and for some reason PyTorch was broadcasting the outputs to be Size[2,2] in the MSE loss, which completely messes everything up. Once I changed:
targets = torch.tensor([3.0, 2.0], dtype=torch.float32)
to
targets = torch.tensor([[3.0], [2.0]], dtype=torch.float32)
It worked as it should. This was obviously done with PyTorch, but I suspect maybe other libraries broadcast variables in the same way.
For me it was happening exactly like in your case, the output of neural network was always the same no matter the training & number of layers etc.
Turns out my back-propagation algorithm had a problem. At one place I was multiplying by -1 where it wasn't required.
There could be another problem like this. The question is how to debug it?
Steps to debug:
Step1 : Write the algorithm such that it can take variable number of input layers and variable number of input & output nodes.
Step2 : Reduce the hidden layers to 0. Reduce input to 2 nodes, output to 1 node.
Step3 : Now train for binary-OR-Operation.
Step4 : If it converges correctly, go to Step 8.
Step5 : If it doesn't converge, train it only for 1 training sample
Step6 : Print all the forward and prognostication variables (weights, node-outputs, deltas etc)
Step7 : Take pen&paper and calculate all the variables manually.
Step8 : Cross verify the values with algorithm.
Step9 : If you don't find any problem with 0 hidden layers. Increase hidden layer size to 1. Repeat step 5,6,7,8
It sounds like a lot of work, but it works very well IMHO.
I know, that for the original post this is far, too late but maybe I can help someone with this, as I faced the same problem.
For me the problem was, that my input data had missing values in important columns, where the training/test data were not missing. I replaced these values with zero values and voilĂ , suddenly the results were plausible. So maybe check your data, maybe it si misrepresented
It's hard to tell without seeing a code sample but it is possible occure for a net because its number of hidden neron.with incresing in number of neron and number of hiden layer it is not possible to train a net with small set of training data.until it is possible to make a net with smaller layer and nerons it is amiss to use a larger net.therefore perhaps your problem solved with attention to this matters.
I haven't tested it with the XOR problem in the question, but for my original dataset based on Tic-Tac-Toe, I believe that I have gotten the network to train somewhat (I only ran 1000 epochs, which wasn't enough): the quickpropagation network can win/tie over half of its games; backpropagation can get about 41%. The problems came down to implementation errors (small ones) and not understanding the difference between the error derivative (which is per-weight) and the error for each neuron, which I did not pick up on in my research. #darkcanuck's answer about training the bias similarly to a weight would probably have helped, though I didn't implement it. I also rewrote my code in Python so that I could more easily hack with it. Therefore, although I haven't gotten the network to match the minimax algorithm's efficiency, I believe that I have managed to solve the problem.
I faced a similar issue earlier when my data was not properly normalized. Once I normalized the data everything ran correctly.
Recently, I faced this issue again and after debugging, I found that there can be another reason for neural networks giving the same output. If you have a neural network that has a weight decay term such as that in the RSNNS package, make sure that your decay term is not so large that all weights go to essentially 0.
I was using the caret package for in R. Initially, I was using a decay hyperparameter = 0.01. When I looked at the diagnostics, I saw that the RMSE was being calculated for each fold (of cross validation), but the Rsquared was always NA. In this case all predictions were coming out to the same value.
Once I reduced the decay to a much lower value (1E-5 and lower), I got the expected results.
I hope this helps.
I was running into the same problem with my model when number of layers is large. I was using a learning rate of 0.0001. When I lower the learning rate to 0.0000001 the problem seems solved. I think algorithms stuck on local minumums when learning rate is too low
It's hard to tell without seeing a code sample, but a bias bug can have that effect (e.g. forgetting to add the bias to the input), so I would take a closer look at that part of the code.
Based on your comments, I'd agree with #finnw that you have a bias problem. You should treat the bias as a constant "1" (or -1 if you prefer) input to each neuron. Each neuron will also have its own weight for the bias, so a neuron's output should be the sum of the weighted inputs, plus the bias times its weight, passed through the activation function. Bias weights are updated during training just like the other weights.
Fausett's "Fundamentals of Neural Networks" (p.300) has an XOR example using binary inputs and a network with 2 inputs, 1 hidden layer of 4 neurons and one output neuron. Weights are randomly initialized between +0.5 and -0.5. With a learning rate of 0.02 the example network converges after about 3000 epochs. You should be able to get a result in the same ballpark if you get the bias problems (and any other bugs) ironed out.
Also note that you cannot solve the XOR problem without a hidden layer in your network.
I encountered a similar issue, I found out that it was a problem with how my weights were being generated.
I was using:
w = numpy.random.rand(layers[i], layers[i+1])
This generated a random weight between 0 and 1.
The problem was solved when I used randn() instead:
w = numpy.random.randn(layers[i], layers[i+1])
This generates negative weights, which helped my outputs become more varied.
I ran into this exact issue. I was predicting 6 rows of data with 1200+ columns using nnet.
Each column would return a different prediction but all of the rows in that column would be the same value.
I got around this by increasing the size parameter significantly. I increased it from 1-5 to 11+.
I have also heard that decreasing your decay rate can help.
I've had similar problems with machine learning algorithms and when I looked at the code I found random generators that were not really random. If you do not use a new random seed (such Unix time for example, see http://en.wikipedia.org/wiki/Unix_time) then it is possible to get the exact same results over and over again.