Finding a complement of a regular language - regular-language

Could you help me please to find a complement of a language, which ends with abab - (a|b)*abab (over an alphabet {a,b})
I guess, the complement must contain all string, that don't end with abab.
One can try to do it with Rij-Algorithm after building a DFA for complement of (a|b)*abab, but pleaseee, help me to understand how it works without Automaton and Rij (because that Automaton has 5 states).
Ok, the words are not allowed to end with abab. There are 24 ways for four letters of a's and b's at the end. Okay, abab must be erased so there are 15 combinations. Does it mean, that the complement-language is (a|b)*.(union of all those combinations of a's and b's without abab)? But does (a|b) still stay the same at the beginning?
Help me please to understand this.

Maybe I quiet don't understand you, but isn't it much simplier. I'e (a|b)*(a|bb|aab|bbab) or event (a|b)*(a|(b|(a|bb)a)b)?
P.S. Don't forget that there is words shorter than abab and all of them should be included too. I.e. (a|b){0,3} (where {0,3} denotes amount of repeats [0; 3])

Related

Algorithm (or pointer to literature) sought for string processing challenge

A group of amusing students write essays exclusively by plagiarising portions of the complete works of WIlliam Shakespere. At one end of the scale, an essay might exclusively consist a verbatim copy of a soliloquy... at the other, one might see work so novel that - while using a common alphabet - no two adjacent characters in the essay were used adjacently by Will.
Essays need to be graded. A score of 1 is assigned to any essay which can be found (character-by-character identical) in the plain-text of the complete works. A score of 2 is assigned to any work that can be successfully constructed from no fewer than two distinct (character-by-character identical) passages in the complete works, and so on... up to the limit - for an essay with N characters - which scores N if, and only if, no two adjacent characters in the essay were also placed adjacently in the complete works.
The challenge is to implement a program which can efficiently (and accurately) score essays. While any (practicable) data-structure to represent the complete works is acceptable - the essays are presented as ASCII strings.
Having considered this teasing question for a while, I came to the conclusion that it is much harder than it sounds. The naive solution, for an essay of length N, involves 2**(N-1) traversals of the complete works - which is far too inefficient to be practical.
While, obviously, I'm interested in suggested solutions - I'd also appreciate pointers to any literature that deals with this, or any similar, problem.
CLARIFICATIONS
Perhaps some examples (ranging over much shorter strings) will help clarify the 'score' for 'essays'?
Assume Shakespere's complete works are abridged to:
"The quick brown fox jumps over the lazy dog."
Essays scoring 1 include "own fox jump" and "The quick brow". The essay "jogging" scores 6 (despite being short) because it can't be represented in fewer than 6 segments of the complete works... It can be segmented into six strings that are all substrings of the complete works as follows: "[j][og][g][i][n][g]". N.B. Establishing scores for this short example is trivial compared to the original problem - because, in this example "complete works" - there is very little repetition.
Hopefully, this example segmentation helps clarify the 2*(N-1) substring searches in the complete works. If we consider the segmentation, the (N-1) gaps between the N characters in the essay may either be a gap between segments, or not... resulting in ~ 2*(N-1) substring searches of the complete works to test each segmentation hypothesis.
An (N)DFA would be a wonderful solution - if it were practical. I can see how to construct something that solved 'substring matching' in this way - but not scoring. The state space for scoring, on the surface, at least, seems wildly too large (for any substantial complete works of Shakespere.) I'd welcome any explanation that undermines my assumptions that the (N)DFA would be too large to be practical to compute/store.
A general approach for plagiarism detection is to append the student's text to the source text separated by a character not occurring in either and then to build either a suffix tree or suffix array. This will allow you to find in linear time large substrings of the student's text which also appear in the source text.
I find it difficult to be more specific because I do not understand your explanation of the score - the method above would be good for finding the longest stretch in the students work which is an exact quote, but I don't understand your N - is it the number of distinct sections of source text needed to construct the student's text?
If so, there may be a dynamic programming approach. At step k, we work out the least number of distinct sections of source text needed to construct first k characters of the student's text. Using a suffix array built just from the source text or otherwise, we find the longest match between the source text and characters x..k of the student's text, where x is of course as small as possible. Then the least number of sections of source text needed to construct the first k characters of student text is the least needed to construct 1..x-1 (which we have already worked out) plus 1. By running this process for k=1..the length of the student text we find the least number of sections of source text needed to reconstruct the whole of it.
(Or you could just search StackOverflow for the student's text, on the grounds that students never do anything these days except post their question on StackOverflow :-)).
I claim that repeatedly moving along the target string from left to right, using a suffix array or tree to find the longest match at any time, will find the smallest number of different strings from the source text that produces the target string. I originally found this by looking for a dynamic programming recursion but, as pointed out by Evgeny Kluev, this is actually a greedy algorithm, so let's try and prove this with a typical greedy algorithm proof.
Suppose not. Then there is a solution better than the one you get by going for the longest match every time you run off the end of the current match. Compare the two proposed solutions from left to right and look for the first time when the non-greedy solution differs from the greedy solution. If there are multiple non-greedy solutions that do better than the greedy solution I am going to demand that we consider the one that differs from the greedy solution at the last possible instant.
If the non-greedy solution is going to do better than the greedy solution, and there isn't a non-greedy solution that does better and differs later, then the non-greedy solution must find that, in return for breaking off its first match earlier than the greedy solution, it can carry on its next match for longer than the greedy solution. If it can't, it might somehow do better than the greedy solution, but not in this section, which means there is a better non-greedy solution which sticks with the greedy solution until the end of our non-greedy solution's second matching section, which is against our requirement that we want the non-greedy better solution that sticks with the greedy one as long as possible. So we have to assume that, in return for breaking off the first match early, the non-greedy solution gets to carry on its second match longer. But this doesn't work, because, when the greedy solution finally has to finish using its first match, it can jump on to the same section of matching text that the non-greedy solution is using, just entering that section later than the non-greedy solution did, but carrying on for at least as long as the non-greedy solution. So there is no non-greedy solution that does better than the greedy solution and the greedy solution is optimal.
Have you considered using N-Grams to solve this problem?
http://en.wikipedia.org/wiki/N-gram
First read the complete works of Shakespeare and build a trie. Then process the string left to right. We can greedily take the longest substring that matches one in the data because we want the minimum number of strings, so there is no factor of 2^N. The second part is dirt cheap O(N).
The depth of the trie is limited by the available space. With a gigabyte of ram you could reasonably expect to exhaustively cover Shakespearean English string of length at least 5 or 6. I would require that the leaf nodes are unique (which also gives a rule for constructing the trie) and keep a pointer to their place in the actual works, so you have access to the continuation.
This feels like a problem of partial matching a very large regular expression.
If so it can be solved by a very large non deterministic finite state automata or maybe more broadly put as a graph representing for every character in the works of Shakespeare, all the possible next characters.
If necessary for efficiency reasons the NDFA is guaranteed to be convertible to a DFA. But then this construction can give rise to 2^n states, maybe this is what you were alluding to?
This aspect of the complexity does not really worry me. The NDFA will have M + C states; one state for each character and C states where C = 26*2 + #punctuation to connect to each of the M states to allow the algorithm to (re)start when there are 0 matched characters. The question is would the corresponding DFA have O(2^M) states and if so is it necessary to make that DFA, theoretically it's not necessary. However, consider that in the construction, each state will have one and only one transition to exactly one other state (the next state corresponding to the next character in that work). We would expect that each one of the start states will be connected to on average M/C states, but in the worst case M meaning the NDFA will have to track at most M simultaneous states. That's a large number but not an impossibly large number for computers these days.
The score would be derived by initializing to 1 and then it would incremented every time a non-accepting state is reached.
It's true that one of the approaches to string searching is building a DFA. In fact, for the majority of the string search algorithms, it looks like a small modification on failure to match (increment counter) and success (keep going) can serve as a general strategy.

Shortest Common Superstring algorithm?

I'm trying to code this problem here:
but I'd like to find an algorithm that breaks down the steps for solving the problem. I can't seem to find anything too useful online so I've come here to ask if anyone knows of a resource which I can use to refer to an algorithm that solves this problem.
This is called the shortest common supersequence problem. The idea is that in order for the supersequence to be the shortest, we want to find as many shared bits of a and b as possible. We can solve the problem in two steps:
Find the longest common subsequence of a and b.
Insert the remaining bits of a and b while preserving the order of these bits.
We can solve the longest common subsequence problem using dynamic programming.
I agree with Terry Li: it is only NP-complete to find the SCS of multiple sequences. For 2 sequences (say s is of length n and t is of length m), my solution (doesn't use LCS but uses something similar) is done in O(nm) time:
1) Run a global alignment, in which you disallow mismatches, don't penalize indels, and give a positive score to matches (I did +1 for matches, -10 for mismatches, and 0 for indels, but these can be adjusted). (This is O(nm))
2) Iterate over the global alignment for both output strings v and w. If v[i] isn't a gap, output it. Otherwise, output w[i]. (This is O(n+m)).

How does the Failure function used in KMP algorithm work?

I've tried my best reading most of the literature on this, and still haven't understood anything about how the failure function used in KMP algorithm is constructed. I've been referring mostly to http://community.topcoder.com/tc?module=Static&d1=tutorials&d2=stringSearching tutorial which most of the people consider excellent. However, I still have not understood it. I'd be thankful if you could take the pain of giving me a simpler and easy to understand explanation on it.
The failure function actually tells us this: if you matched X characters of a string, what is the longest suffix of such string such that it's also a prefix of a search string.
You are asking how it's built, the approach is quite straightforward.
If you add a new character at the end of a string, that is you are building f[x], and if it matches with character at position f[x-1], then f[x] is simply f[x-1]+1.
In the other cases where it doesn't match, you try to find smaller and smaller suffixes and check if they match.
For example, you have a word "accadaccac" for which you are building a failure function and you just added the letter 'c'. Let's say you are building a failure function for the last letter, the letter 'c'.
First you check the failure function of the previous letter, its failure function was 4 because you can match suffix "acca" with the prefix "acca", and now you add the letter 'c', it doesn't match with the letter 'd' succeeding prefix "acca".
So you backtrack, to the last good suffix. You are now searching for a suffix of "acca" which is also a prefix of "accadaccac", but is smaller than "acca". The answer to that question is f[length("acca")-1], or f[3], which is f[3] = 1, because suffix of length 1 (just the letter 'a') is also a prefix of a search string.
And now you can try if the 'c' matches with the character on the position 1, and voila, it matches, so now you know f[9] = f[f[8]-1]+1 = 2.
I hope this will help you. Good luck! :)
http://www.oneous.com/Tutorial-Content.php?id=24
U can use the learning resources in this website for understanding the KMP Algorithm and the failure function. Also try to take the code and do some runs on it for an example string by hand. However, the best way to understand its working would be to code it yourself on some variations of the basic algorithm. I suggest u start with NHAY and PERIOD on SPOJ.

A reverse inference engine (find a random X for which foo(X) is true)

I am aware that languages like Prolog allow you to write things like the following:
mortal(X) :- man(X). % All men are mortal
man(socrates). % Socrates is a man
?- mortal(socrates). % Is Socrates mortal?
yes
What I want is something like this, but backwards. Suppose I have this:
mortal(X) :- man(X).
man(socrates).
man(plato).
man(aristotle).
I then ask it to give me a random X for which mortal(X) is true (thus it should give me one of 'socrates', 'plato', or 'aristotle' according to some random seed).
My questions are:
Does this sort of reverse inference have a name?
Are there any languages or libraries that support it?
EDIT
As somebody below pointed out, you can simply ask mortal(X) and it will return all X, from which you can simply pick a random one from the list. What if, however, that list would be very large, perhaps in the billions? Obviously in that case it wouldn't do to generate every possible result before picking one.
To see how this would be a practical problem, imagine a simple grammar that generated a random sentence of the form "adjective1 noun1 adverb transitive_verb adjective2 noun2". If the lists of adjectives, nouns, verbs, etc. are very large, you can see how the combinatorial explosion is a problem. If each list had 1000 words, you'd have 1000^6 possible sentences.
Instead of the deep-first search of Prolog, a randomized deep-first search strategy could be easyly implemented. All that is required is to randomize the program flow at choice points so that every time a disjunction is reached a random pole on the search tree (= prolog program) is selected instead of the first.
Though, note that this approach does not guarantees that all the solutions will be equally probable. To guarantee that, it is required to known in advance how many solutions will be generated by every pole to weight the randomization accordingly.
I've never used Prolog or anything similar, but judging by what Wikipedia says on the subject, asking
?- mortal(X).
should list everything for which mortal is true. After that, just pick one of the results.
So to answer your questions,
I'd go with "a query with a variable in it"
From what I can tell, Prolog itself should support it quite fine.
I dont think that you can calculate the nth solution directly but you can calculate the n first solutions (n randomly picked) and pick the last. Of course this would be problematic if n=10^(big_number)...
You could also do something like
mortal(ID,X) :- man(ID,X).
man(X):- random(1,4,ID), man(ID,X).
man(1,socrates).
man(2,plato).
man(3,aristotle).
but the problem is that if not every man was mortal, for example if only 1 out of 1000000 was mortal you would have to search a lot. It would be like searching for solutions for an equation by trying random numbers till you find one.
You could develop some sort of heuristic to find a solution close to the number but that may affect (negatively) the randomness.
I suspect that there is no way to do it more efficiently: you either have to calculate the set of solutions and pick one or pick one member of the superset of all solutions till you find one solution. But don't take my word for it xd

String transformation

I came across the following article which got me interested in this particular problem.
Given two words "CAT", "FAR" determine if you can get from the first
to the second via single transformations of valid words....e.g. 1
transformation gets you from CAT to CAR changing T to R, then another
gets you from CAR to FAR changing the C to F...all are valid english
words.
Any ideas? Not really sure how to begin to be honest. If you point me in the right direction, then that will be enough. Thanks!
As noted in this answer (thanks, aix), this is a shortest-path problem, and can be efficiently solved with the A* algorithm using the Hamming distance (i.e. the number of letters by which two words differ) as a heuristic.
There are 3 points to consider :
1 How many characters are different between the two given words ? Its just not the char, but its position in the word also matters. So compare on position.
2 Determine for each transformation , if the resulting word is a valid english word. Some reference of correct words will be needed here.
3 Work out the sequence of transforms that each intermediate word is valid.
This is going to be a try-err approach I guess. Any backtracking algorithm will be a good choice.

Resources