What is the difference between Atom Transition, Set Transition and Epsilon Transition in the ANTLR4 ATN? - antlr4

What is the difference between Atom Transition, Set Transition and Epsilon Transition in ANTLR4 ATN? Couldn't find any definitions online.

You won't find any definition, because that's an internal detail which is of no interest for most people.
The different transition types are mostly used to indicate the condition to match when the ATN walking algorithm processes them. There are 10 of them:
Epsilon: a transition that has no condition and consumes no input. The walker just skips them.
Range: probably an older version of the Set transition and not used in ANTLR4. The only difference is that Range takes start and end values of the range to match, while Set takes an interval set.
Rule: an epsilon transition in the parser ATN to a subrule. Consumes nothing.
Predicate: an epsilon transition with an attached predicate as condition. Consumes nothing.
Atom: matches a single input symbol and consumes only one.
Action: an epsilon transition that consumes nothing but has an action attached (executed in target code when this transition is being processed).
Set: matches a set of intervals (to allow gaps) and consumes only one input symbol. Typical example is: [a-zA-Z]. Used only in the lexer.
Not set: a set transition with inverted intervals (Full Unicode minus the given set).
Wildcard: matches any single input and consumes one input symbol.
Precedence (aka Precedence Predicate): Not 100% sure about this one. Seems to be used for precedence in transformed left recursive rules and also has a predicate attached. In that sense it's a pretty special transition, compared to the others.
Here's an example for some of the transitions:
This is the ATN for the rule: LETTER: [a-zA-Z] '$';. It starts with the ATN state 1 and has a single epsilon transition to the first basic state. That one has a single outgoing Set transtion to another basic state. From there an Atom transition goes out to yet another intermediate state and from there to the rule end.
For this and more visualizations, grammar debugging and ANTLR4 language support install my vscode extension for ANTLR4 (provided you use Visual Studio Code as grammar editor).

Related

Regarding state machine diagram, if two actions trigger the same transition from a state, how should corresponding state transition table look like?

I am trying to convert this state machine diagram into its corresponding state transition table but I don't know what to write in the cell marked with question mark.
Should I write both (a2/e2) and (a3/e3)?
Your table is a state transition matrix. You need in any case to show both alternative paths in the S2->S1 cell, i.e. a2/e2 and a3/e3.
There is as far as I know no standard specification for such matrix. So any of the following should do:
separating the two transition rules, e.g. with a semicolon between the two transition rules or with an OR between the two (hint: comma separation might be confusing if you're sing the UML syntax, since it allows several comma-separated triggers for the same effect);
visualizing the two transition rules one above the other;
or splitting just this cell into two vertically stacked subcells.
What you cannot do is to duplicate the whole line, since this would break the matrix. This technique only works with 1D state transition tables (i.e. each line correspond to one arrow)

One transition with multiple events in UML State diagram

We are learning in school that behavioral State diagram's transition has syntax:
list of events [guard condition] / list of actions
But I couldn't find any example on Internet where is used transition with multiple events. Is it really possible? If yes, how does it behave? Does it mean that transaction is realized when one of this events occurs (and of course condition is fulfilled)?
Yes, a transition can be triggered by one of many events in a list. You would use such a construct to avoid multiple lines between states, making a tidier diagram.
Here is what the 2.5 spec says:
14.2.3.9.2 Enabled Transitions
A Transition is enabled if and only if:
[ . . . ]
At least one of the triggers of the Transition has an Event that is matched by the Event type of the dispatched Event occurrence.
These logically OR'ed transitions are specified textually as a comma-separated list on the transition, as specified in §14.2.4.9:
[<trigger> [‘,’ <trigger>]* [‘[‘ <guard>’]’] [‘/’ <behavior-expression>]]
Unfortunately the UML spec is not specific in that respect (I thought, but Jim has the right answer). Anyway:
14.2.4.9 Transition
The default textual notation for a Transition is defined by the following BNF expression:
[<trigger> [‘,’ <trigger>]* [‘[‘ <guard>’]’] [‘/’ <behavior-expression>]]
Where is the standard notation for Triggers (see sub clause 13.3.4), is a Boolean expression for a guard, and the optional is an expression specifying the effect Behavior written in some vendor- specific or standard textual surface language (see sub clause 16.1). The trigger may be any of the standard trigger types. SignalEvent triggers and CallEvent triggers are not distinguishable by syntax and must be discriminated by their declaration elsewhere.
There are other places in the specs where this paragraph appears in similar way, but without explaining how multiple triggers will be treated. I assume that it's an OR-condition. But that's only an assumption. Alas, since you have not seen examples (me neither) it is probably an unknown fact. Just don't use it - that's indeed possible ;-) And if you happen to find an example, just ask the author what he meant. UML is about talking to each other.

What is a Suffix Automaton?

Can someone please explain to me what exactly is a suffix automaton, and how it works and differs from suffix trees and suffix arrays? I have already tried searching on the web but was not able to come across any clear comprehensive explanation.
I found the following link closest to what I wanted but it is in Russian and the translation to English was difficult to understand.
http://e-maxx.ru/algo/suffix_automata
A suffix automaton is a finite-state machine that recognizes all the suffixes of a string. There's a lot of resources on finite-state machines you can read, Wikipedia being a good start.
Suffix trees and suffix arrays are data structures containting all the suffixes of a string. There are multiple algorithms to build and act on these structures to perform operations efficiently on strings.
Suffix Machine:
Suffix machine (or a directed acyclic graph of words) is a powerful
data structure that allows to solve many string problems.
For example, using the suffix of the machine, you can search for all
occurrences of one string into another, or to count the number of
different substrings of the given string - both tasks it can solve in
linear time.
On an intuitive level, suffix automaton can be understood as concise
information about all the substrings of a given string. An impressive
fact is that the suffix automaton contains all the information in such
a concise form, which for a string of length n it requires only a O(n)
memory. Moreover, it can also be built over time O(n) (if we consider
the size of the alphabet k constant; otherwise, during O (n log k)).
Historically, the first linear size suffix of the machine was opened in 1983 Blumer and others, and in 1985 - 1986 he was presented
the first algorithms build in linear time (Crochemore, Blumer and
others). For more detail see references at the end of the article.
In English the suffix machine called "suffix automaton" (in the plural
- "suffix automata"), and a directed acyclic graph of the words "directed acyclic word graph (or simply "DAWG").
The definition of the suffix automaton:
Definition. The suffix automaton for the given string s is called a
minimal deterministic finite automaton that accepts all suffixes of
the string s.
We will explain this definition.
Suffix automaton is a directed acyclic graph, in which vertices are called States, and the arcs of the graph is the transitions between
these States.
One of the States t_0 is called the initial state, and it must be the origin of the graph (i.e. it achievable for all other States).
Each transition in the automaton is arc marked with some symbol. All transitions originating from any state must have different
labels. (On the other hand, may not be transitions for any
characters.)
One or more of the conditions marked as terminal States. If we go from the initial state t_0 any way to any terminal state, and let us
write this label all arcs traversed, you get a string, which must be
one of the suffixes of the string s.
The suffix automaton contains the minimum number of vertices among all the machines that satisfy the above conditions. (The minimum
number of transitions is not required because the condition of
minimality of the number of States in the machine may not be "extra"
ways - otherwise it would break the previous property.)
Elementary properties of the suffix automaton:
The simplest, and yet most important property of the suffix automaton
is that it contains information about all the substrings of the string
s. Namely, any path from the initial state t_0 if we write out the
labels of the arcs along this path, forms necessarily a substring of a
string s. Conversely, any substring of the string s corresponds to
some path starting in the initial state t_0.
In order to simplify the explanation, we will say that a substring
corresponds to the path from the initial state, the labels along which
form the substring. Conversely, we will say that any path corresponds
to one row which is formed by the labels of its arcs.
In each state machine suffix is one or more paths from the initial
state. Let's say that the state corresponds to the set of strings that
match all of these ways.
EXAMPLES:

Algorithm for string replacement based on conditional char replacement

Usage case: I'm writing a domain specific language (DSL) for a regex-like but way more powerful Lispy string processing system focused on conditional replacements (like simulation of language evolution for conlangers/linguists) rather than matching as regexes do. As usual I wrote down the specs before actually writing down the code.
However, due to a somewhat stupid but hard to fix mistake, I ended up with a system only capable of doing stuff one char at a time. Thus, a rewrite rule might be (in pseudocode) change 'a' to 'e' when last char is 's' and next char is 'd'. Chars can also be deleted: delete 'a' when ....
Since the interpreter for the DSL is a bit spaghetti-ish (not in the sense of unstructured, but in the sense that 1. I haven't figured out OO for my implementation lang Chicken Scheme 2. No IDE, so must remember 20+ variable names and use emacs) I don't want to touch it, but rather "unsugar" string replacements to conditional char replacements.
The trivial example: change "ab" to "cd" unconditionally rewrites to change 'a' to 'c' when followed by 'b'; change 'b' to 'd' when preceded by a. However, when there are conditions, things become very ugly very quick. Is there some easy recursive way to do the rewriting, or is this nearly impossible in the rewriting phase and I should probably fix my DSL interpreter? (Note: my DSL has ways to get the n-th letter before and after the current char)
The problem is that since we are going through the data character-at-a-time, when a condition is applied to a multi-character string, that condition has to be expressed in different ways for every position. For instance "abc" followed by "x" combines in a straightforward way into the condition for a, b and c, but has to change shape. The x is actually three positions away from a, but only two from b. This is bad because it causes a proliferation of conditions, which all get wastefully evaluated.
I'd solve this by adding the concept of frames into the interpreter. A frame is established at the current character position, and then holds that position somehow, allowing frame-relative addressing of the characters.
I can think of a few ways of introducing this position fixing. One would be to introduce variable binding into the interpreter. It could support a pair of instructions bind symbol and unbind n, where we would be using a gensym for the symbol.
When generating the code for an operation on a string like "abc", we would generate an instruction like bind #:g0025, which would fix the position of the a, and then the compiler will analyze the conditions applied to the string, and re-phrase them in terms that are relative to #:g0025. After the processing of "abc", we would emit unbind 1 to drop the most recently bound variable.
We could also bind variables to the Boolean values of conditions.
As an example with the named frames, suppose we have
Replace "abc" with "ijk" when preceded by "x" and followed by "yz".
This goes to something like:
bind #:frame
bind #:cond0 to when #:frame[-1] is "x" and #:frame[3] is "y" and #:frame[4] is "z"
replace "a" with "i" when #:cond0 ; these insns move the char position
replace "b" with "j" when #:cond0
replace "c" with "k" when #:cond0
unbind 2
So the difficulty has been translated to one of compiling the condition into frame-relative addressing. The #:frame[3] is derived from the length of the "abc" pattern, which is available to the translator all at once. That is information not available in the target language, which doesn't have "abc" all at once.
The system almost certainly needs some way to try different matches at the same location. If there is no "abc" at the current position, another rule which replaces "foo" with something has to be tried at the same position. Perhaps when the conditions fail, the instruction doesn't advance the character position. So in our example above, that would work: all instructions share the same condition, so in the case of a match the position moves by three positions, otherwise it doesn't. Still, in spite of that, there may be a requirement to have multiple edits with different conditions at the same spot. The scope of my answer isn't to design the whole thing, though.

drawing minmal DFA for the given regular expression

What is the direct and easy approach to draw minimal DFA, that accepts the same language as of given Regular Expression(RE).
I know it can be done by:
Regex ---to----► NFA ---to-----► DFA ---to-----► minimized DFA
But is there any shortcut way? like for (a+b)*ab
Regular Expression to DFA
Although there is NO algorithmic shortcut to draw DFA from a Regular Expression(RE) but a shortcut technique is possible by analysis not by derivation, it can save your time to draw a minimized dfa. But off-course the technique you can learn only by practice. I take your example to show my approach:
(a + b)*ab
First, think about the language of the regular expression. If its difficult to sate what is the language description at first attempt, then find what is the smallest possible strings can be generate in language then find second smallest.....
Keep memorized solution of some basic regular expressions. For example, I have written here some basic idea to writing left-linear and right-linear grammars directly from regular expression. Similarly you can write for construing minimized dfa.
In RE (a + b)*ab, the smallest possible string is ab because using (a + b)* one can generate NULL(^) string. Second smallest string can be either aab or bab. Now one thing we can easily notice about language is that any string in language of this RE always ends with ab (suffix), Whereas prefix can be any possible string consist of a and b including ^.
Also, if current symbol is a; then one possible chance is that next symbol would be a b and string end. Thus in dfa we required, a transition such that when ever a b symbol comes after symbol a, then it should be move to some of the final state in dfa.
Next, if a new symbol comes on final state then we should move to some non-final state because any symbol after b is possible only in middle of some string in language as all language string terminates with suffix 'ab'.
So with this knowledge at this stage we can draw an incomplete transition diagram like below:
--►(Q0)---a---►(Q1)---b----►((Qf))
Now at this point you need to understand: every state has some meaning for example
(Q0) means = Start state
(Q1) means = Last symbol was 'a', and with one more 'b' we can shift to a final state
(Qf) means = Last two symbols was 'ab'
Now think what happens if a symbol a comes on final state. Just more to state Q1 because this state means last symbol was a. (updated transition diagram)
--►(Q0)---a---►(Q1)---b----►((Qf))
▲-----a--------|
But suppose instead of symbol a a symbol b comes at final state. Then we should move from final state to some non-final state. In present transition graph in this situation we should make a move to initial state from final state Qf.(as again we need ab in string for acceptation)
--►(Q0)---a---►(Q1)---b----►((Qf))
▲ ▲-----a--------|
|----------------b--------|
This graph is still incomplete! because there is no outgoing edge for symbol a from Q1. And for symbol a on state Q1 a self loop is required because Q1 means last symbol was an a.
a-
||
▼|
--►(Q0)---a---►(Q1)---b----►((Qf))
▲ ▲-----a--------|
|----------------b--------|
Now I believe all possible out-going edges are present from Q1 & Qf in above graph. One missing edge is an out-going edge from Q0 for symbol b. And there must be a self loop at state Q0 because again we need a sequence of ab so that string can be accept. (from Q0 to Qf shift is possible with ab)
b- a-
|| ||
▼| ▼|
--►(Q0)---a---►(Q1)---b----►((Qf))
▲ ▲-----a--------|
|----------------b--------|
Now DFA is complete!
Off-course the method might look difficult at first few tries. But if you learn to draw this way you will observe improvement in your analytically skills. And you will find this method is quick and objective way to draw DFA.
* In the link I given, I described some more regular expressions, I would highly encourage you to learn them and try to make DFA for those regular expressions too.

Resources