drawing minmal DFA for the given regular expression - regular-language

What is the direct and easy approach to draw minimal DFA, that accepts the same language as of given Regular Expression(RE).
I know it can be done by:
Regex ---to----► NFA ---to-----► DFA ---to-----► minimized DFA
But is there any shortcut way? like for (a+b)*ab

Regular Expression to DFA
Although there is NO algorithmic shortcut to draw DFA from a Regular Expression(RE) but a shortcut technique is possible by analysis not by derivation, it can save your time to draw a minimized dfa. But off-course the technique you can learn only by practice. I take your example to show my approach:
(a + b)*ab
First, think about the language of the regular expression. If its difficult to sate what is the language description at first attempt, then find what is the smallest possible strings can be generate in language then find second smallest.....
Keep memorized solution of some basic regular expressions. For example, I have written here some basic idea to writing left-linear and right-linear grammars directly from regular expression. Similarly you can write for construing minimized dfa.
In RE (a + b)*ab, the smallest possible string is ab because using (a + b)* one can generate NULL(^) string. Second smallest string can be either aab or bab. Now one thing we can easily notice about language is that any string in language of this RE always ends with ab (suffix), Whereas prefix can be any possible string consist of a and b including ^.
Also, if current symbol is a; then one possible chance is that next symbol would be a b and string end. Thus in dfa we required, a transition such that when ever a b symbol comes after symbol a, then it should be move to some of the final state in dfa.
Next, if a new symbol comes on final state then we should move to some non-final state because any symbol after b is possible only in middle of some string in language as all language string terminates with suffix 'ab'.
So with this knowledge at this stage we can draw an incomplete transition diagram like below:
--►(Q0)---a---►(Q1)---b----►((Qf))
Now at this point you need to understand: every state has some meaning for example
(Q0) means = Start state
(Q1) means = Last symbol was 'a', and with one more 'b' we can shift to a final state
(Qf) means = Last two symbols was 'ab'
Now think what happens if a symbol a comes on final state. Just more to state Q1 because this state means last symbol was a. (updated transition diagram)
--►(Q0)---a---►(Q1)---b----►((Qf))
▲-----a--------|
But suppose instead of symbol a a symbol b comes at final state. Then we should move from final state to some non-final state. In present transition graph in this situation we should make a move to initial state from final state Qf.(as again we need ab in string for acceptation)
--►(Q0)---a---►(Q1)---b----►((Qf))
▲ ▲-----a--------|
|----------------b--------|
This graph is still incomplete! because there is no outgoing edge for symbol a from Q1. And for symbol a on state Q1 a self loop is required because Q1 means last symbol was an a.
a-
||
▼|
--►(Q0)---a---►(Q1)---b----►((Qf))
▲ ▲-----a--------|
|----------------b--------|
Now I believe all possible out-going edges are present from Q1 & Qf in above graph. One missing edge is an out-going edge from Q0 for symbol b. And there must be a self loop at state Q0 because again we need a sequence of ab so that string can be accept. (from Q0 to Qf shift is possible with ab)
b- a-
|| ||
▼| ▼|
--►(Q0)---a---►(Q1)---b----►((Qf))
▲ ▲-----a--------|
|----------------b--------|
Now DFA is complete!
Off-course the method might look difficult at first few tries. But if you learn to draw this way you will observe improvement in your analytically skills. And you will find this method is quick and objective way to draw DFA.
* In the link I given, I described some more regular expressions, I would highly encourage you to learn them and try to make DFA for those regular expressions too.

Related

What is the difference between Atom Transition, Set Transition and Epsilon Transition in the ANTLR4 ATN?

What is the difference between Atom Transition, Set Transition and Epsilon Transition in ANTLR4 ATN? Couldn't find any definitions online.
You won't find any definition, because that's an internal detail which is of no interest for most people.
The different transition types are mostly used to indicate the condition to match when the ATN walking algorithm processes them. There are 10 of them:
Epsilon: a transition that has no condition and consumes no input. The walker just skips them.
Range: probably an older version of the Set transition and not used in ANTLR4. The only difference is that Range takes start and end values of the range to match, while Set takes an interval set.
Rule: an epsilon transition in the parser ATN to a subrule. Consumes nothing.
Predicate: an epsilon transition with an attached predicate as condition. Consumes nothing.
Atom: matches a single input symbol and consumes only one.
Action: an epsilon transition that consumes nothing but has an action attached (executed in target code when this transition is being processed).
Set: matches a set of intervals (to allow gaps) and consumes only one input symbol. Typical example is: [a-zA-Z]. Used only in the lexer.
Not set: a set transition with inverted intervals (Full Unicode minus the given set).
Wildcard: matches any single input and consumes one input symbol.
Precedence (aka Precedence Predicate): Not 100% sure about this one. Seems to be used for precedence in transformed left recursive rules and also has a predicate attached. In that sense it's a pretty special transition, compared to the others.
Here's an example for some of the transitions:
This is the ATN for the rule: LETTER: [a-zA-Z] '$';. It starts with the ATN state 1 and has a single epsilon transition to the first basic state. That one has a single outgoing Set transtion to another basic state. From there an Atom transition goes out to yet another intermediate state and from there to the rule end.
For this and more visualizations, grammar debugging and ANTLR4 language support install my vscode extension for ANTLR4 (provided you use Visual Studio Code as grammar editor).

How to prove the correctness of a given grammar?

I am wondering how programming langauge developers validate and prove that their grammar is correct. Suppose that I created a new grammar for a new langauge. I can test my grammar with a unit test tool by providing different kinds of test programs. However, I will never 100% ensure that my grammar is correct. How do language developers ensure that their grammar is correct in real world?
Let's say I created a grammar for a new language using pencil and paper. However, I did a mistake and my grammar accepts the expressions that end with a + like 2+2+. I will implement my language using this incorrect grammar, if I don't find the mistake in it. After implementation and unit testing, I can find the error. Is it possible to find it before starting any implementation?
Definitely, I can try my grammar with some sample inputs using pencil and paper (derivation etc.), but I may miss some corner cases. Is there a better approach or how in the real language developers test their grammar?
A proof is a logical argument that demonstrates the truth of a claim. There are as many ways to prove something as there are ways of thinking about a problem. A common way to prove things about discrete structures (like grammars) is using mathematical induction. Basically, you show that something is true in base cases - the simplest cases possible - and then show that if it's true for all cases under a certain size, it must therefore be true for cases of the next size.
In our case: suppose we wanted only to prove your grammar didn't generate + at the end of a word. We could do induction on the number of productions used in constructing a string in the language. We would identify all relevant base cases, show the property holds for these strings, and then show that longer strings in the language are constructed in such a way that it is impossible to get a + at the end. Here's an example.
S := S + S | (S) | x
Base case: the shortest string in the language is x, generated as S -> x. It does not end with a +.
Induction hypothesis: assume all strings produced using up to and including k productions do not end with +.
Induction step: we must show strings produced using more than k productions do not end with +. If we apply the rule (S) to any string generated from S, we do not add + so the property holds. If we apply S + S to strings generated from S, the last symbol in S + S is the last symbol of a shorter string (at least 2 symbols shorter) generated by S. By the induction hypothesis, that string did not end in +, so neither does this one. There are no other productions, so no string in the language ends in +. QED

How can I find the closure of a D FA

I am trying to implement the closure of D FA. I have successfully implemented Union, Compliment Intersection, Subtraction and Concatenation of D FA without using N FA. Our teacher did not tell us the algorithm to find the closure. I tried to do it by concatenating a D FA to itself but quite obviously it did not work.
I just need the steps by the way I am representing D FA by using matrix. Alongside can you please elaborate on Klein closure but I am sure I can can do that once I know how to get the closure.
I am not sure what you mean by closure in general. But for the Kleene closure you can proceed like this: from every final state you add an epsilon-transition (not reading anything) to the start state. So after reading one word of the language, you can start with another one.
Of course, the resulting automaton is not deterministic any more. But there are standard procedures for determinizing it again.
For a direct construction: look at all the transition that enter a final state, for example (q,a) -> f. From the outgoing state add another transition reading the same symbol and going to the start state: (q,a) -> s. So the automaton has the possibility to finish reading the word and just after that starting again.

How to determine whether given language is regular or not(by just looking at the language)?

Is there any trick to guess if a language is regular by just looking at the language?
In order to choose proof methods, I have to have some hypothesis at first. Do you know any hints/patterns required to reduce time consumption in solving long questions?
For instance, in order not to spend time on pumping lemma, when language is regular and I don't want to construct DFA/grammar.
For example:
1. L={w ε {a,b}*/no of a in (w) < no of b in (w)}
2. L={a^nb^m/n,m>=0}
How to tell which is regular by just looking at the above examples??
In general, when looking at a language, a good rule of thumb for whether the language is regular or not is to think of a program that can read a string and answer the question "is this string in the language?"
To write such a program, do you need to store some arbitrary value in a variable or is the program's state (that is, the combination of all possible variables' values) limited to some finite fixed number of possibilities? If the language can be recognized by a program that only needs a fixed number of variables that can only have a fixed number of values, then you've got a regular language. If not, then not.
Using this, I can see that the first language is not regular, but the second language is. In the first language, I need to remember how many as I've seen, and how many bs. (Or at the very least, I need to keep track of (# of as) - (# of bs), and accept if the string ends while that count is negative). At the same time, there's no limit on the number of as, so this count could go arbitrarily large.
In the second language, I don't care what n and m are at all. So with the second language, my program would just keep track of "have I seen at least one b yet?" to make sure we don't have any a characters that occur after the first b. (So, one variable with only two values - true or false)
So one way to make language 1 into a regular language is to change it to be:
1. L={w ∈ {a,b}*/no of a in (w) < no of b in (w), and no of a in (w) < 100}
Now I don't need to keep track of the number of as that I've seen once I hit 100 (since then I know automatically that the string isn't in the language), and likewise with the number of bs - once I hit 100, I can stop counting because I know that'll be enough unless the number of as is itself too large.
One common case you should watch out for with this is when someone asks you about languages where "number of as is a multiple of 13" or "w ∈ {0,1}* and w is the binary representation of a multiple of 13". With these, it might seem like you need to keep track of the whole number to make the determination, but in fact you don't - in both cases, you only need to keep a variable that can count from 0 to 12. So watch out for "multiple of"-type languages. (And the related "is odd" or "is even" or "is 1 more than a multiple of 13")
Other mathematical properties though - for example, w ∈ {0,1}* and w is the binary representation of a perfect square - will result in non-regular languages.

Algorithm for string replacement based on conditional char replacement

Usage case: I'm writing a domain specific language (DSL) for a regex-like but way more powerful Lispy string processing system focused on conditional replacements (like simulation of language evolution for conlangers/linguists) rather than matching as regexes do. As usual I wrote down the specs before actually writing down the code.
However, due to a somewhat stupid but hard to fix mistake, I ended up with a system only capable of doing stuff one char at a time. Thus, a rewrite rule might be (in pseudocode) change 'a' to 'e' when last char is 's' and next char is 'd'. Chars can also be deleted: delete 'a' when ....
Since the interpreter for the DSL is a bit spaghetti-ish (not in the sense of unstructured, but in the sense that 1. I haven't figured out OO for my implementation lang Chicken Scheme 2. No IDE, so must remember 20+ variable names and use emacs) I don't want to touch it, but rather "unsugar" string replacements to conditional char replacements.
The trivial example: change "ab" to "cd" unconditionally rewrites to change 'a' to 'c' when followed by 'b'; change 'b' to 'd' when preceded by a. However, when there are conditions, things become very ugly very quick. Is there some easy recursive way to do the rewriting, or is this nearly impossible in the rewriting phase and I should probably fix my DSL interpreter? (Note: my DSL has ways to get the n-th letter before and after the current char)
The problem is that since we are going through the data character-at-a-time, when a condition is applied to a multi-character string, that condition has to be expressed in different ways for every position. For instance "abc" followed by "x" combines in a straightforward way into the condition for a, b and c, but has to change shape. The x is actually three positions away from a, but only two from b. This is bad because it causes a proliferation of conditions, which all get wastefully evaluated.
I'd solve this by adding the concept of frames into the interpreter. A frame is established at the current character position, and then holds that position somehow, allowing frame-relative addressing of the characters.
I can think of a few ways of introducing this position fixing. One would be to introduce variable binding into the interpreter. It could support a pair of instructions bind symbol and unbind n, where we would be using a gensym for the symbol.
When generating the code for an operation on a string like "abc", we would generate an instruction like bind #:g0025, which would fix the position of the a, and then the compiler will analyze the conditions applied to the string, and re-phrase them in terms that are relative to #:g0025. After the processing of "abc", we would emit unbind 1 to drop the most recently bound variable.
We could also bind variables to the Boolean values of conditions.
As an example with the named frames, suppose we have
Replace "abc" with "ijk" when preceded by "x" and followed by "yz".
This goes to something like:
bind #:frame
bind #:cond0 to when #:frame[-1] is "x" and #:frame[3] is "y" and #:frame[4] is "z"
replace "a" with "i" when #:cond0 ; these insns move the char position
replace "b" with "j" when #:cond0
replace "c" with "k" when #:cond0
unbind 2
So the difficulty has been translated to one of compiling the condition into frame-relative addressing. The #:frame[3] is derived from the length of the "abc" pattern, which is available to the translator all at once. That is information not available in the target language, which doesn't have "abc" all at once.
The system almost certainly needs some way to try different matches at the same location. If there is no "abc" at the current position, another rule which replaces "foo" with something has to be tried at the same position. Perhaps when the conditions fail, the instruction doesn't advance the character position. So in our example above, that would work: all instructions share the same condition, so in the case of a match the position moves by three positions, otherwise it doesn't. Still, in spite of that, there may be a requirement to have multiple edits with different conditions at the same spot. The scope of my answer isn't to design the whole thing, though.

Resources