Given two DFA's, how can I check that the language generated by the first DFA is included in the one generated by the second one? - regular-language

I have to give an algorithm to check whether given two DFA's, the language generated by the first one is included in the generated by the other one.
For example, suppose that the first language recognizes the words over {a,b}* that have length 2, while the second one recognizes the words over {a,b}* that have length 3 or less. In this example the first language is included in the second one.
One of my ideas y to minimize both DFA's and see if the states and transitions in DFA1 are also included in DFA2, but I think it is not a good solution.

Related

How to determine whether given language is regular or not(by just looking at the language)?

Is there any trick to guess if a language is regular by just looking at the language?
In order to choose proof methods, I have to have some hypothesis at first. Do you know any hints/patterns required to reduce time consumption in solving long questions?
For instance, in order not to spend time on pumping lemma, when language is regular and I don't want to construct DFA/grammar.
For example:
1. L={w ε {a,b}*/no of a in (w) < no of b in (w)}
2. L={a^nb^m/n,m>=0}
How to tell which is regular by just looking at the above examples??
In general, when looking at a language, a good rule of thumb for whether the language is regular or not is to think of a program that can read a string and answer the question "is this string in the language?"
To write such a program, do you need to store some arbitrary value in a variable or is the program's state (that is, the combination of all possible variables' values) limited to some finite fixed number of possibilities? If the language can be recognized by a program that only needs a fixed number of variables that can only have a fixed number of values, then you've got a regular language. If not, then not.
Using this, I can see that the first language is not regular, but the second language is. In the first language, I need to remember how many as I've seen, and how many bs. (Or at the very least, I need to keep track of (# of as) - (# of bs), and accept if the string ends while that count is negative). At the same time, there's no limit on the number of as, so this count could go arbitrarily large.
In the second language, I don't care what n and m are at all. So with the second language, my program would just keep track of "have I seen at least one b yet?" to make sure we don't have any a characters that occur after the first b. (So, one variable with only two values - true or false)
So one way to make language 1 into a regular language is to change it to be:
1. L={w ∈ {a,b}*/no of a in (w) < no of b in (w), and no of a in (w) < 100}
Now I don't need to keep track of the number of as that I've seen once I hit 100 (since then I know automatically that the string isn't in the language), and likewise with the number of bs - once I hit 100, I can stop counting because I know that'll be enough unless the number of as is itself too large.
One common case you should watch out for with this is when someone asks you about languages where "number of as is a multiple of 13" or "w ∈ {0,1}* and w is the binary representation of a multiple of 13". With these, it might seem like you need to keep track of the whole number to make the determination, but in fact you don't - in both cases, you only need to keep a variable that can count from 0 to 12. So watch out for "multiple of"-type languages. (And the related "is odd" or "is even" or "is 1 more than a multiple of 13")
Other mathematical properties though - for example, w ∈ {0,1}* and w is the binary representation of a perfect square - will result in non-regular languages.

Algorithm for string processing

I am looking for a algorithm for string processing, I have searched for it but couldn't find a algorithm that meets my requirements. I will explain what the algorithm should do with an example.
There are two sets of word sets defined as shown below:
**Main_Words**: swimming, driving, playing
**Words_in_front**: I am, I enjoy, I love, I am going to go
The program will search through a huge set of words as soon it finds a word that is defined in Main_Words it will check the words in front of that Word to see if it has any matching words defined in Words_in_front.
i.e If the program encounters the word "Swimming" it has to check if the words in front of the word "Swimming" are one of these: I am, I enjoy, I love, I am going to go.
Are there any algorithms that can do this?
A straightforward way to do this would be to just do a linear scan through the text, always keeping track of the last N+1 words (or characters) you see, where N is the number of words (or characters) in the longest phrase contained in your words_in_front collection. When you have a "main word", you can just check whether the sequence of N words/characters before it ends with any of the prefixes you have.
This would be a bit faster if you transformed your words_in_front set into a nicer data structure, such as a hashmap (perhaps keyed by last letter in the phrase..) or a prefix/suffix tree of some sort, so you wouldn't have to do an .endsWith over every single member of the set of prefixes each time you have a matching "main word." As was stated in another answer, there is much room for optimization and a few other possible implementations, but there's a start.
Create a map/dictionary/hash/associative array (whatever is defined in your language) with key in Main_Words and Words_in_front are the linked list attached to the entry pointed by the key. Whenever you encounter a word matching a key, go to the table and see if in the attached list there are words that match what you have in front.
That's the basic idea, it can be optimized for both speed and space.
You should be able to build a regular expression along these lines:
I (am|enjoy|love|am going to go) (swimming|driving|playing)

String fuzzy lookup

This is an interview question.
Given a file consisting of names, what data structure would you use to validate whether a name is in the list. What if we say a name is valid if it differs by no more than one character against a name in the file?
I would say that it depends on the context: if you have millions of names, a contract to fulfil and a product that does it for you, then I'd say go for it and forget about writing it yourself.
However, in the context of an interview question, my suggestion would be a DAWG that contains all possible mistakes.
A long time ago I heard that spell-checkers contain a list of words with possible mistakes (instead of trying to match against a list of valid words), but I don't know how true that is.
I did work once on a problem of finding a word in a list of words (with mistakes), but it wasn't restricted to a single mistake, and not a lot of memory was available. So the words were simply stored as a list (a DAWG requires nodes and pointers which would have required too much overhead).
I would suggest storing the names from the file into a trie or DAWG (better space efficiency).
Upon name arrival, start traversing the data structure. You'll have 4 variants:
Name found --> name is valid
Dead end in the data structure --> check number of characters left in the name, if no more than 1 --> name is valid; invalid otherwise.
Name ended and haven't arrived to a leaf in the structure --> check if there is at least one leaf attached to the current position (will take O(size of the alphabet)) --> if so, name is valid; invalid otherwise.
Difference encountered in the middle of the word --> continue traversing from the next character --> no more errors allowed (paragraphs 2 & 3 aren't valid anymore from this point).
For the first question (exact search), you can use a hash table or a trie. Bloom filters may tell you "No" earlier with a space overhead, but can never tell you a definite "Yes."
For the second question (fuzzy search), much more advanced techniques are needed. Check the blog at http://blog.srch2.com/2012/03/fuzzy-search.html to discuss different solutions to this problem.

A reverse inference engine (find a random X for which foo(X) is true)

I am aware that languages like Prolog allow you to write things like the following:
mortal(X) :- man(X). % All men are mortal
man(socrates). % Socrates is a man
?- mortal(socrates). % Is Socrates mortal?
yes
What I want is something like this, but backwards. Suppose I have this:
mortal(X) :- man(X).
man(socrates).
man(plato).
man(aristotle).
I then ask it to give me a random X for which mortal(X) is true (thus it should give me one of 'socrates', 'plato', or 'aristotle' according to some random seed).
My questions are:
Does this sort of reverse inference have a name?
Are there any languages or libraries that support it?
EDIT
As somebody below pointed out, you can simply ask mortal(X) and it will return all X, from which you can simply pick a random one from the list. What if, however, that list would be very large, perhaps in the billions? Obviously in that case it wouldn't do to generate every possible result before picking one.
To see how this would be a practical problem, imagine a simple grammar that generated a random sentence of the form "adjective1 noun1 adverb transitive_verb adjective2 noun2". If the lists of adjectives, nouns, verbs, etc. are very large, you can see how the combinatorial explosion is a problem. If each list had 1000 words, you'd have 1000^6 possible sentences.
Instead of the deep-first search of Prolog, a randomized deep-first search strategy could be easyly implemented. All that is required is to randomize the program flow at choice points so that every time a disjunction is reached a random pole on the search tree (= prolog program) is selected instead of the first.
Though, note that this approach does not guarantees that all the solutions will be equally probable. To guarantee that, it is required to known in advance how many solutions will be generated by every pole to weight the randomization accordingly.
I've never used Prolog or anything similar, but judging by what Wikipedia says on the subject, asking
?- mortal(X).
should list everything for which mortal is true. After that, just pick one of the results.
So to answer your questions,
I'd go with "a query with a variable in it"
From what I can tell, Prolog itself should support it quite fine.
I dont think that you can calculate the nth solution directly but you can calculate the n first solutions (n randomly picked) and pick the last. Of course this would be problematic if n=10^(big_number)...
You could also do something like
mortal(ID,X) :- man(ID,X).
man(X):- random(1,4,ID), man(ID,X).
man(1,socrates).
man(2,plato).
man(3,aristotle).
but the problem is that if not every man was mortal, for example if only 1 out of 1000000 was mortal you would have to search a lot. It would be like searching for solutions for an equation by trying random numbers till you find one.
You could develop some sort of heuristic to find a solution close to the number but that may affect (negatively) the randomness.
I suspect that there is no way to do it more efficiently: you either have to calculate the set of solutions and pick one or pick one member of the superset of all solutions till you find one solution. But don't take my word for it xd

String transformation

I came across the following article which got me interested in this particular problem.
Given two words "CAT", "FAR" determine if you can get from the first
to the second via single transformations of valid words....e.g. 1
transformation gets you from CAT to CAR changing T to R, then another
gets you from CAR to FAR changing the C to F...all are valid english
words.
Any ideas? Not really sure how to begin to be honest. If you point me in the right direction, then that will be enough. Thanks!
As noted in this answer (thanks, aix), this is a shortest-path problem, and can be efficiently solved with the A* algorithm using the Hamming distance (i.e. the number of letters by which two words differ) as a heuristic.
There are 3 points to consider :
1 How many characters are different between the two given words ? Its just not the char, but its position in the word also matters. So compare on position.
2 Determine for each transformation , if the resulting word is a valid english word. Some reference of correct words will be needed here.
3 Work out the sequence of transforms that each intermediate word is valid.
This is going to be a try-err approach I guess. Any backtracking algorithm will be a good choice.

Resources