Typical Number of generations for NEAT (Neuro Evolution of Augmenting Topologies)? - python-3.x

Does anyone have an estimate of the number of generations one should search before concluding that the NEAT-algorithm is not able to reach the minima?
I am running NEAT on a very small dataset of cancer patients (~5K rows). And after 5000 generations, the concordance index for prediction of survival index is not improving.
Does anyone have any experience of how many generations should one try before you deem this as not efficient for the given problem?

There are a couple other hyperparameters to consider before deciding NEAT cannot produce a usable neural network for your problem. You will have to make sure that your population is also large enough. Obviously a larger dataset is more helpful, but that is limited. Finally, changes such as mutation rates, aggregation options, activation functions, and your fitness function will all affect the training process for each genome. Feel free to PM if you want suggestions on them.

Related

Word2Vec clustering: embed with low dimensionality or with high dimensionality and then reduce?

I am using K-means for topic modelling using Word2Vec and would like to understand the implications of vectorizing up to, let's say, 10 dimensions, against embedding it with 200 dimensions and then using PCA to get down to 10. Does the second approach make sense at all?
Which one worked better for your specific purposes, & your specific data, after trying both & comparing the end-results against each other, either in some ad-hoc ("eyeballing") or rigorous way?
There's no reason to prematurely reject any approach, given how many details about your data & ultimate end-goals are unstated.
It would be atypical to train a word2vec model to have only 10 dimensions. Published work most often shows the use of 100 to 1000 dimensions, often 300 or 400, assuming you've got enough bulk training data to make the algorithm worthwhile.
(Word2vec needs a lot of varied training text, with many contrasting usage examples for every word of interest, to generate good results. You may occasionally see toy-sized demos, on smaller amounts of data, just to quickly show steps, or some major qualities of the results. But good results, in the aspects for which word2vec is most appreciated, depend on plentiful training data.)
Also, whether or not your aims would be helped by the extra step of PCA to reduce the dimensionality of a larger word2vec model seems another separable question, to be determined experimentally by comparing results with and without that step, on your actual data/problem, rather than guessed at from intuitions from other projects that might not be comparable.

How to avoid over fitting?

I have a situation where:
My training accuracy is 93%
CV accuracy is 55%
Test accuracy is 57%
I think this is a classical case of overfitting.
As per my knowledge, I can use regularization.
I have read cross validation will also helps in solving my over fitting problem.
Some inquiries I have regarding this:
Whether cross validation is used only for hyperparameter tuning, or will it have a role in solving over fitting problem?
If cross validation solves overfitting problems, how?
Whether cross validation is used only as a check to see whether the model is over fitting or not?
I think you are confused on what exactly cross validation is. I will link to OpenML's explanation for 10-fold cross validation so you get a better idea.
Over-fitting occurs normally when there is not enough data for your model to train on, resulting in it learning patterns/similarities between the data set that is not helpful, such as putting too much focus on outlying data that would be ignored if given a larger data set.
Now to your questions:
1-2. Cross-validation is just one solution that is helpful for preventing/solving over-fitting. Through partitioning the data set into k-sub groups, or folds, you then can train your model on k-1 folds. The last fold will be used as your unseen validation data to test your model upon. This will sometimes help prevent over-fitting. A factor in this working though depends on how long/how many epochs you are training your data for. Since you said you have a relatively small data set, you want to make sure you aren't 'over-learning' on this data. Implementing cross-validation will not do you much good if you are training for hundreds/thousands of epochs on a really small data set.
Cross-validation doesn't tell you if your data is being over-fitted. It may give you hints that it is if your results are vastly different after several times running the program, but it is not going to be clear cut.
The biggest problem, and you said it yourself in the comments, is you don't have a lot of data. The best, although not always the easiest way, is to increase your data size so your model won't learn unimportant tendencies and put too much focus on the outliers.
I will link to a website that is incredibly helpful in explaining the problems of over-fitting and gives a variety of ways to attempt to overcome this problem.
Let me know if I was of help!

Why does removing validation samples from Keras model improve test accuracy so much

I'm doing a programming assignment for Andrew Ng's Deep Learning course on Convolutional Models that involves training and evaluating a model using Keras. What I've observed after a little playing with various knobs is something curious: The test accuracy of the model greatly improves (from 50 percentile to 90 percentile) by setting the validation_fraction parameter on the Model.fit operation to 0. This is surprising to me; I would have thought that eliminating the validation samples would lead to over-fitting of the model, which would, in turn, reduce accuracy on the test set.
Can someone please explain why this is happening?
You're right, there is more training data, but the increase is pretty negligible since dI was setting the validation fraction to 0.1, so that would increase the training data by 11.111...% However, thinking about it some more, I realized that removing the validation step doesn't have any effect on the model, hence no impact on test accuracy. I think that I must have changed some other parameter, too, though I don't remember which.
As Matias says, it means there is more training data to work with.
However, I'd also make sure that the test accuracy is actually increasing from 50 to 90% consistently. Run it over a couple times to make sure. There is a possibility that, because there is very little validation samples, that the model got lucky. That's why it is important to have a lot of validation data - to make sure the model isn't just getting lucky, and that there's actually a method to the madness.
I go over some of the "norms" when it comes to training and testing data in my book about stock prediction (another great way in my opinion to learn about Deep Learning). Feel free to check it out and learn more, as it's great for beginners.
Good Luck!

How to create Training data for Text classification on 4 categories

My machine learning goal is to search for potential risks (will cost more money) and opportunities (will save money) from a Project Requirements document.
My idea is to classify sentences from the data into one of these categories: Risk, Opportunity and Irrelevant (no risk, no opportunity, default categorie).
I will use a multinomial Bayes classifier for this with tf-dif.
Now I need to have data for my training set and test set. The way I will do this is label every sentence from requirement documents with 1 of the 3 categories. Is this a good approach?
Or should I only label sentences which are obviously a risk/opportunity/irrelevant?
Also, is the Irrelevant categorie a good idea?
I believe the three-class approach is a good one. This is similar to sentiment analysis, where you typically have positive, negative and neutral documents (or sentences). The neutral comprises the vast majority of the instances, so your classification problem will be unbalanced. That is not necessarily an issue, but for difficult problems like this one, a naive bayes classifier might simply classify everything in the neutral/irrelevant bucket since the prior for neutral will be quite high.
your sampling (labeling) should be representative of the reality. Don't try to create a dataset of 1000 risk, 1000 opportunity, 1000 irrelevant. Instead, take a sample of say 10000 requirements, and assign the proper label to each, even if it means having much more 'Irrelevant' than 'Risk' for instance.
text classification models require many instances, since the search space is vast. I wonder if you have considered the fact that to get reliable results (say over 90%), you may need to manually label thousands of instances.
and even if you have thousands of training instances, your problem looks particularly difficult, unless there are some obvious keywords to trigger 'risk' or 'opportunity' that I don't understand. Ask yourself: would this be easy for a human to judge? If you asked 3 judges to classify your requirements, would they all come up with the same answer? If not, then it might be 10s of thousands of training documents that you will need, and the classification accuracy may still be disappointing.

What is an appropriate training set size for sentiment analysis?

I'm looking to use some tweets about measles/ the mmr vaccine to see how sentiment about vaccination changes over time. I plan on creating the training set from the corpus of data I currently have (unless someone has a recommendation on where I can get similar data).
I would like to classify a tweet as either: Pro-vaccine, Anti-Vaccine, or Neither (these would be factual tweets about outbreaks).
So the question is: How big is big enough? I want to avoid problems of overfitting (so I'll do a test train split) but as I include more and more tweets, the number of features needing to be learned increases dramatically.
I was thinking 1000 tweets (333 of each). Any input is appreciated here, and if you could recommend some resources, that would be great too.
More is always better. 1000 tweets on a 3-way split seems quite ambitious, I would even consider 1000 per class for a 3-way split on tweets quite low. Label as many as you can within a feasible amount of time.
Also, it might be worth taking a cascaded approach (esp. with so little data), i.e. label a set vaccine vs non-vaccine, and within the vaccine subset you'd have a pro vs anti set.
In my experience trying to model a catch-all "neutral" class, that contains everything that is not explicitly "pro" or "anti" is quite difficult because there is so much noise. Especially with simpler models such as Naive Bayes, I have found the cascaded approach to be working quite well.

Resources