Why is my genetic algorithm seemingly behaving randomly? - haskell

I'm attempting to evolve optimal strategies for the Iterated Prisoner's Dilemma using a basic genetic algorithm (Stochastic Universal Sampling, 1-point crossover, Canonical GA). I've implemented this algorithm in Haskell and recently added chart output. Unfortunately the graphs produced don't fit the expected pattern for this problem so it appears I have a bug.
All graphs of fitnesses I have seen for this problem look something like this:
Other examples can be seen in On Evolving Robust Strategies for Iterated Prisoner's Dilemma, P.J. Darwen and X. Yao (1993) p6-7
However my output looks like this:
If I set mutation rate to 1 I get:
Perhaps suggesting that my selection function is not being quite so random as I had thought as the graph implies a homogeneous population.
My code is in this git repository should you wish to inspect it.
Now for the question: Could any of you suggest what I might be doing wrong in my GA implementation to make the graph look like this?
e.g. I would assume it is unlikely to be the fitness function as I am using the same fitness function for output that it is maximising so even if the fitness function is wrong in some way it will still be maximising that wrong function (though I'm sure I could be wrong here, I'm rather new to genetic algorithms)
I would just like suggestions for which functions to look at, I'm tearing my hair out trying to fix this.
EDIT: Having added some debug code to my combine function it seems that it is always being passed the same individuals (even with mutation set to 1) so presumably selection is going wrong somewhere.
EDIT: Selection was going wrong, but that wasn't causing all the problems, just homogeneity in the population.

You have a function maybeFlip, which will change an allele to its opposite with a given probability. Hence, when the mutation rate is 1, you will just keep flipping all the alleles back and forth between two opposites. This explains the zig-zag pattern seen in your graph.
Also, swap is in Data.Tuple :)

Related

Failing to see the difference between these two lines of thought in dynamic programming

Always there are multiple ways people describe differences in tabulation and memoization in dynamic programming, but I will summarize to what is normally said.
memoization is a where we add caching to a function, to make recursive calls take less computations. typically used on recursive functions for a top down solution that starts with the initial problem and then recursively calls itself to smaller problems
tabulation uses a table to keep track of subproblem results and works up in bottom up manner, solving smallest sub problems before larger ones in a iterative manner.
Well my question is whats the difference? Sometimes I look at different situations and the line is super blurred. Also, with memoization working in a "top down" fashion, its really just referring to the stack nature of it, and in that sense its still going to the base case, aka bottom and then using those results to build up to the final result, so how is that really different from a tabulation going from bottom up until its done? Or is it a situational case where tabulation aproaches don't involve recursion, the fact that a dynmaic programming problem uses it IS what differentiates the two different methods? If someone knowledgable could offer there thoughts it would be much appreciated
You're right that they're just two implementation methods for the same computation. A recursive formulation with memoization will fill in the memo cache with the same entries that an explicit tabular formulation will put in its table.
Explicit tabular formulations are strictly less useful, however. This is because they need more information about the problem in advance. They start by enumerating all possibly useful base cases and putting those in the table. (So what's "possibly useful?" That's the rub!) Then they enumerate the new "layer" of all possible problem versions that can be solved with the base cases. Then a layer of others that can be solved with those, etc. etc. This continues until it the "top level" problem turns up in a layer.
For the kinds of problems typically seen in textbooks and coding interviews, determining all useful base cases is deliberately easy. The problem parameters are 2 or 3 "dense" natural numbers, so the table of solutions can be a 2d or 3d array with all elements containing useful values. In many of these, you can prove that the current layer only depends on a few (possibly one) previous layer, so all the rest can be discarded, which saves memory.
Practical problems aren't often so nice. The parameter sets aren't small or aren't natural numbers, or - even when they are - they're sparse so that filling in all entries of an array would be a waste.
In these cases, memoization is the only reasonable choice. The top-down recursion determines the sub-problems (on down to the base cases) that need solving as they occur. Sparseness doesn't matter because the memo cache can store parameter sets as explicit keys. When the current layer doesn't need more than K previous ones, various strategies can still be applied to discard the others.

Are transformer-based language models overfitting on the paraphrase identification task? What tools overcome this?

I've been working on a sentence transformation task that involves paraphrase identification as a critical step: if we are confident enough that the state of the program (a sentence repeatedly modified) has become a paraphrase of a target sentence, stop transforming. The overall goal is actually to study potential reasoning in predictive models that can generate language prior to a target sentence. The approach is just one specific way of reaching that goal. Nevertheless, I've become interested in the paraphrase identification task itself, as it's received some boost from language models recently.
The problem I run into is when I manipulate sentences from examples or datasets. For example, in this HuggingFace example, if I negate either sequence or change the subject to Bloomberg, I still get a majority "is paraphrase" prediction. I started going through many examples in the MSRPC training set and negating one sentence in a positive example or making one sentence in a negative example a paraphrase of the other, especially when doing so would be a few word edit. I found to my surprise that various language models, like bert-base-cased-finetuned-mrpc and textattack/roberta-base-MRPC, don't change their confidences much on these sorts of changes. It's surprising as these models claim an f1 score of 0.918+. The dataset is clearly missing a focus on negative examples and small perturbative examples.
My question is, are there datasets, techniques, or models that deal well when given small edits? I know that this is an extremely generic question, much more than is typically asked on StackOverflow, but my concern is in finding practical tools. If there is a theoretical technique, then it might not be suitable as I'm in the category of "available tools define your approach" rather than vice-versa. So I hope that the community would have a recommendation on this.
Short answer to the question: yes, they are overfitting. Most of the important NLP data sets are not actually well-crafted enough to test what they claim to test, and instead test the ability of the model to find subtle (and not-so-subtle) patterns in the data.
The best tool I know for creating data sets that help deal with this is Checklist. The corresponding paper, "Beyond Accuracy: Behavioral Testing of NLP models with CheckList" is very readable and goes into depth on this type of issue. They have a very relevant table... but need some terms:
We prompt users to evaluate each capability with
three different test types (when possible): Minimum Functionality tests, Invariance, and Directional Expectation tests... A Minimum Functionality test (MFT), is a collection of simple examples (and labels) to check a
behavior within a capability. MFTs are similar to
creating small and focused testing datasets, and are
particularly useful for detecting when models use
shortcuts to handle complex inputs without actually
mastering the capability.
...An Invariance test (INV) is when we apply
label-preserving perturbations to inputs and expect
the model prediction to remain the same.
A Directional Expectation test (DIR) is similar,
except that the label is expected to change in a certain way. For example, we expect that sentiment
will not become more positive if we add “You are
lame.” to the end of tweets directed at an airline
(Figure 1C).
I haven't been actively involved in NLG for long, so this answer will be a bit more anecdotal than SO's algorithms would like. Starting with the fact that in my corner of Europe, the general sentiment toward peer review requirements for any kind of NLG project are higher by several orders of magnitude compared to other sciences - and likely not without reason or tensor thereof.
This makes funding a bigger challenge, so wherever you are, I wish you luck on that front. I'm not sure of how big of a deal this site is in the niche, but [Ehud Reiter's Blog][1] is where I would start looking into your tooling ideas.
Maybe even reach out to them/him personally, because I can't think of another source that has an academic background and a strong propensity for practical applications of NLG, at least based on the kind of content they've been putting out over the years.
Your background, environment/funding, and seniority level/control you have over the project will eventually compose your vector decision for you. I's just how it goes on the bleeding edge of anything. What I will add, though, is not to limit yourself to a single language or technology in this phase because of those precise reasons you've mentioned. I'd recommend the same in terms of potential open source involvement but if your profile information is accurate, that probably won't happen, no matter what you do and accomplish.
But yeah, in the grand scheme of things, your question is far from too broad, in my view. It identifies a rather unmistakable problem pattern that not all branches of science are as lackadaisical to approach as NLG-adjacent fields seem to be right now. In that regard, it's not broad enough and will need to be promulgated far and wide before community-driven tooling will give you serious options on a micro level.
Blasphemy, sure, but the performance is already stacked against you As for the question potentially being too broad, I'd posit it is not broad enough, so long as we collectively remain in a "oh, I was waiting for you to start doing something about it" phase.
P.S. I'd eliminate any Rust and ECMAScript alternatives prior to looking into Python, blapshemous as this might sound to a 2021 data scientist
. Some might ARight nowccounting forr the ridicule this would receive xou sltrsfx hsbr s fszs drz zhsz s mrnzsl rcrtvidr, sz lrsdz
due to performance easons.
[1]: https://ehudreiter.com/2016/12/18/nlg-vs-templates/

Cannot generalize my Genetic Algorithm to new Data

I've written a GA to model a handful of stocks (4) over a period of time (5 years). It's impressive how quickly the GA can find an optimal solution to the training data, but I am also aware that this is mainly due to it's tendency to over-fit in the training phase.
However, I still thought I could take a few precautions and and get some kind of prediction on a set of unseen test stocks from the same period.
One precaution I took was:
When multiple stocks can be bought on the same day the GA only buys one from the list and it chooses this one randomly. I thought this randomness might help to avoid over-fitting?
Even if over-fitting is still occurring,shouldn't it be absent in the initial generations of the GA since it hasn't had a chance to over-fit yet?
As a note, I am aware of the no-free-lunch theorem which demonstrates ( I believe) that there is no perfect set of parameters which will produce an optimal output for two different datasets. If we take this further, does this no-free-lunch theorem also prohibit generalization?
The graph below illustrates this.
->The blue line is the GA output.
->The red line is the training data (slightly different because of the aforementioned randomness)
-> The yellow line is the stubborn test data which shows no generalization. In fact this is the most flattering graph I could produce..
The y-axis is profit, the x axis is the trading strategies sorted from worst to best ( left to right) according to there respective profits (on the y axis)
Some of the best advice I've received so far (thanks seaotternerd) is to focus on the earlier generations and increase the number of training examples. The graph below has 12 training stocks rather than just 4, and shows only the first 200 generations (instead of 1,000). Again, it's the most flattering chart I could produce, this time with medium selection pressure. It certainly looks a little bit better, but not fantastic either. The red line is the test data.
The problem with over-fitting is that, within a single data-set it's pretty challenging to tell over-fitting apart from actually getting better in the general case. In many ways, this is more of an art than a science, but here are some general guidelines:
A GA will learn to do exactly what you attach fitness to. If you tell it to get really good at predicting one series of stocks, it will do that. If you keep swapping in different stocks to predict, though, you might be more successful at getting it to generalize. There are a few ways to do this. The one that has had perhaps the most promising results for reducing over-fitting is imposing spatial structure on the population and evaluating on different test cases in different cells, as in the SCALP algorithm. You could also switch out the test cases on a time basis, but I've had more mixed results with that sort of an approach.
You are correct that over-fitting should be less of a problem early on. Generally, the longer you run a GA, the more over-fitting will be possible. Typically, people tend to assume that the general rules will be learned first, before the rote memorization of over-fitting takes place. However, I don't think I've actually ever seen this studied rigorously - I could imagine a scenario where over-fitting was so much easier than finding general rules that it happens first. I have no idea how common that is, though. Stopping early will also reduce the ability of the GA to find better general solutions.
Using a larger data-set (four stocks isn't that many) will make your GA less susceptible to over-fitting.
Randomness is an interesting idea. It will definitely hurt the GA's ability to find general rules, but it should also reduce over-fitting. Without knowing more about the specifics of your algorithm, it's hard to say which would win out.
That's a really interesting thought about the no free lunch theorem. I'm not 100% sure, but I think it does apply here to some extent - better fitting some data will make your results fit other data worse, by necessity. However, as wide as the range of possible stock behaviors is, it is much narrower than the range of all possible time series in general. This is why it is possible to have optimization algorithms at all - a given problem that we are working with tends produce data that cluster relatively closely together, relative to the entire space of possible data. So, within that set of inputs that we actually care about, it is possible to get better. There is generally an upper limit of some sort on how well you can do, and it is possible that you have hit that upper limit for your data-set. But generalization is possible to some extent, so I wouldn't give up just yet.
Bottom line: I think that varying the test cases shows the most promise (although I'm biased, because that's one of my primary areas of research), but it is also the most challenging solution, implementation-wise. So as a simpler fix you can try stopping evolution sooner or increasing your data-set.

Fixing an incorrectly taken 3D head scan

The problem I am facing is following.
I have a number of 3D head scans, some of them are taken correctly (like attached example) but in many it is easy to see that the scanned person had his head not exactly aligned with the machine's front and thus one side of the texture (and depth map) seems to be "wider" (the exact reason is that one side was taken more from behind, it can be easily seen if you look at the ears).
Fortunately when I go from the cylindrical coordinates to carthesian ones and render the face with XNA, the face is symmetrical.
Now the thing is that I would like the texture and depth maps of all my heads by as nice and symmetrical as the correct one (because later i want to align them and perform PCA).
The idea I have at the moment is that I could interpolate the surfaces between all of the vertices and from those interpolations take new vertices that are equally distanced from each other.
This solutions seems a lot of work and maybe its an overkill.
Maybe there is some other way (like geting that interpolation data from DirectX/XNA that has to calculate it at some point anyway).
I will be most thankful for helpful answers.
The correct example:
http://i55.tinypic.com/332mio2.jpg
Incorrect example:
http://i54.tinypic.com/309ujvt.jpg
It's probably possible to salvage (some of) the bad scans to some degree using some coordinate transformations, but you would have to guess the "incorrectness" of the alignment and it's probably impossible to do automatically.
But, unless the original subject is dead (or otherwise unavailable); it's probably a lot easier to redo the scans.
Making another scan is very likely to be quicker, and you won't loose quality as transforming the bad scans probably will. The nose on the incorrect sample seems to be shadowing the side of the nose, and no fancy algorithm can ever fix the missing data.

Creating a smart text generator

I'm doing this for fun (or as 4chan says "for teh lolz") and if I learn something on the way all the better. I took an AI course almost 2 years ago now and I really enjoyed it but I managed to forget everything so this is a way to refresh that.
Anyway I want to be able to generate text given a set of inputs. Basically this will read forum inputs (or maybe Twitter tweets) and then generate a comment based on the learning.
Now the simplest way would be to use a Markov Chain Text Generator but I want something a little bit more complex than that as the MKC basically only learns by word order (which word is more likely to appear after word x given the input text). I'm trying to see if there's something I can do to make it a little bit more smarter.
For example I want it to do something like this:
Learn from a large selection of posts in a message board but don't weight it too much
For each post:
Learn from the other comments in that post and weigh these inputs higher
Generate comment and post
See what other users' reaction to your post was. If good weigh it positively so you make more posts that are similar to the one made, and vice versa if negative.
It's the weighing and learning from mistakes part that I'm not sure how to implement. I thought about Artificial Neural Networks (mainly because I remember enjoying that chapter) but as far as I can tell that's mainly used to classify things (i.e. given a finite set of choices [x1...xn] which x is this given input) not really generate anything.
I'm not even sure if this is possible or if it is what should I go about learning/figuring out. What algorithm is best suited for this?
To those worried that I will use this as a bot to spam or provide bad answers to SO, I promise that I will not use this to provide (bad) advice or to spam for profit. I definitely will not post it's nonsensical thoughts on SO. I plan to use it for my own amusement.
Thanks!
I was thinking about something like this, too. I think it could pose a significant improvement to use a grammatical analyzer together with a Markov Chain Generator. Then the MC can be trained on text phrases (verb "drive" often together with object "car") and produce grammatically correct sentences.

Resources