How to determine which language is simpler - regular-language

The following language is the complement of a simpler language.
Construct a DFA for the simpler language and then use it to give the state diagram of a DFA for the given language where Σ = {a,b}.
L={ w : w does not contain the substring baba}.
I don't know which is the simpler language, can anybody please explain?

The automata accepting any word containing the substring 'baba' is:
(it's the simpler language. A regexp for it is : (a|b)*baba(a|b)* )
And we can build the complement DFA by turning accepting states into non-accepting and vice-versa as you mentionned it:
(it has to completed)

It's been a while since my Logic & Computability class, but my guess is the complement language Lc would be = { w: w contains the substring 'baba'}.
It's probably pretty easy to make a DFA that accepts substrings of 'baba', you'd probably just have states firstB, firstA, secondB, and secondA and so on.
Making a complement DFA then is trivial, just make the accepting states non-accepting and vice versa.

Related

Is Rust's lexical grammar regular, context-free or context-sensitive?

The lexical grammar of most programming languages is fairly non-expressive in order to quickly lex it. I'm not sure what category Rust's lexical grammar belongs to. Most of it seems regular, probably with the exception of raw string literals:
let s = r##"Hi lovely "\" and "#", welcome to Rust"##;
println!("{}", s);
Which prints:
Hi lovely "\" and "#", welcome to Rust
As we can add arbitrarily many #, it seems like it can't be regular, right? But is the grammar at least context-free? Or is there something non-context free about Rust's lexical grammar?
Related: Is Rust's syntactical grammar context-free or context-sensitive?
The raw string literal syntax is not context-free.
If you think of it as a string surrounded by r#k"…"#k (using the superscript k as a count operator), then you might expect it to be context-free:
raw_string_literal
: 'r' delimited_quoted_string
delimited_quoted_string
: quoted_string
| '#' delimited_quoted_string '#'
But that is not actually the correct syntax, because the quoted_string is not allowed to contain "#k although it can contain "#j for any j<k
Excluding the terminating sequence without excluding any other similar sequence of a different length cannot be accomplished with a context-free grammar because it involves three (or more) uses of the k-repetition in a single production, and stack automata can only handle two. (The proof that the grammar is not context-free is surprisingly complicated, so I'm not going to attempt it here for lack of MathJax. The best proof I could come up with uses Ogden's lemma and the uncommonly cited (but highly useful) property that context-free grammars are closed under the application of a finite-state transducer.)
C++ raw string literals are also context-sensitive [or would be if the delimiter length were not limited, see Note 1], and pretty well all whitespace-sensitive languages (like Python and Haskell) are context-sensitive. None of these lexical analysis tasks is particularly complicated so the context-sensitivity is not a huge problem, although most standard scanner generators don't provide as much assistance as one might like. But there it is.
Rust's lexical grammar offers a couple of other complications for a scanner generator. One issue is the double meaning of ', which is used both to create character literals and to mark lifetime variables and loop labels. Apparently it is possible to determine which of these applies by considering the previously recognized token. That could be solved with a lexical scanner which is capable of generating two consecutive tokens from a single pattern, or it could be accomplished with a scannerless parser; the latter solution would be context-free but not regular. (C++'s use of ' as part of numeric literals does not cause the same problem; the C++ tokens can be recognized with regular expressions, because the ' can not be used as the first character of a numeric literal.)
Another slightly context-dependent lexical issue is that the range operator, .., takes precedence over floating point values, so that 2..3 must be lexically analysed as three tokens: 2 .. 3, rather than two floating point numbers 2. .3, which is how it would be analysed in most languages which use the maximal munch rule. Again, this might or might not be considered a deviation from regular expression tokenisation, since it depends on trailing context. But since the lookahead is at most one character, it could certainly be implemented with a DFA.
Postscript
On reflection, I am not sure that it is meaningful to ask about a "lexical grammar". Or, at least, it is ambiguous: the "lexical grammar" might refer to the combined grammar for all of the languages "tokens", or it might refer to the act of separating a sentence into tokens. The latter is really a transducer, not a parser, and suggests the question of whether the language can be tokenised with a finite-state transducer. (The answer, again, is no, because raw strings cannot be recognized by a FSA, or even a PDA.)
Recognizing individual tokens and tokenising an input stream are not necessarily equivalent. It is possible to imagine a language in which the individual tokens are all recognized by regular expressions but an input stream cannot be handled with a finite-state transducer. That will happen if there are two regular expressions T and U such that some string matching T is the longest token which is a strict prefix of an infinite set of strings in U. As a simple (and meaningless) example, take a language with tokens:
a
a*b
Both of these tokens are clearly regular, but the input stream cannot be tokenized with a finite state transducer because it must examine any sequence of as (of any length) before deciding whether to fallback to the first a or to accept the token consisting of all the as and the following b (if present).
Few languages show this pathology (and, as far as I know, Rust is not one of them), but it is technically present in some languages in which keywords are multiword phrases.
Notes
Actually, C++ raw string literals are, in a technical sense, regular (and therefore context free) because their delimiters are limited to strings of maximum length 16 drawn from an alphabet of 88 characters. That means that it is (theoretically) possible to create a regular expression consisting of 13,082,362,351,752,551,144,309,757,252,761 patterns, each matching a different possible raw string delimiter.

How to prove the correctness of a given grammar?

I am wondering how programming langauge developers validate and prove that their grammar is correct. Suppose that I created a new grammar for a new langauge. I can test my grammar with a unit test tool by providing different kinds of test programs. However, I will never 100% ensure that my grammar is correct. How do language developers ensure that their grammar is correct in real world?
Let's say I created a grammar for a new language using pencil and paper. However, I did a mistake and my grammar accepts the expressions that end with a + like 2+2+. I will implement my language using this incorrect grammar, if I don't find the mistake in it. After implementation and unit testing, I can find the error. Is it possible to find it before starting any implementation?
Definitely, I can try my grammar with some sample inputs using pencil and paper (derivation etc.), but I may miss some corner cases. Is there a better approach or how in the real language developers test their grammar?
A proof is a logical argument that demonstrates the truth of a claim. There are as many ways to prove something as there are ways of thinking about a problem. A common way to prove things about discrete structures (like grammars) is using mathematical induction. Basically, you show that something is true in base cases - the simplest cases possible - and then show that if it's true for all cases under a certain size, it must therefore be true for cases of the next size.
In our case: suppose we wanted only to prove your grammar didn't generate + at the end of a word. We could do induction on the number of productions used in constructing a string in the language. We would identify all relevant base cases, show the property holds for these strings, and then show that longer strings in the language are constructed in such a way that it is impossible to get a + at the end. Here's an example.
S := S + S | (S) | x
Base case: the shortest string in the language is x, generated as S -> x. It does not end with a +.
Induction hypothesis: assume all strings produced using up to and including k productions do not end with +.
Induction step: we must show strings produced using more than k productions do not end with +. If we apply the rule (S) to any string generated from S, we do not add + so the property holds. If we apply S + S to strings generated from S, the last symbol in S + S is the last symbol of a shorter string (at least 2 symbols shorter) generated by S. By the induction hypothesis, that string did not end in +, so neither does this one. There are no other productions, so no string in the language ends in +. QED

Start state for a DFA in Automata

Suppose a DFA has to be designed which accept all string over Σ={0,1}* which start and ends with same symbol(e.g-0110,10101 etc.).Is ε a acceptable string ? Which means,Is start state a final state?
It depends entirely on what is meant. Human languages are vague and imprecise; that's why we invent formalisms like regular expressions in the first place.
If this is an exercise, I would ask whomever is giving you the exercise for clarification. On the surface, two interpretations seem reasonable:
The empty string does not start and end with different letters, so it should not be excluded
The empty string does not start and end with the same letter, so it should not be included
If it is an exercise and you have the original wording, you can provide a quote, but as stated, the answer is simply not clear. If homework, you could always provide two DFAs, one for each interpretation, with some discussion of the ambiguity.
If it is just a question you made up, then you will have to answer for yourself whether you want the empty string in your language.
YES.
The String ε belongs to {0,1}* and its start and end symbols are not different.So it should be accepted by the DFA

drawing minmal DFA for the given regular expression

What is the direct and easy approach to draw minimal DFA, that accepts the same language as of given Regular Expression(RE).
I know it can be done by:
Regex ---to----► NFA ---to-----► DFA ---to-----► minimized DFA
But is there any shortcut way? like for (a+b)*ab
Regular Expression to DFA
Although there is NO algorithmic shortcut to draw DFA from a Regular Expression(RE) but a shortcut technique is possible by analysis not by derivation, it can save your time to draw a minimized dfa. But off-course the technique you can learn only by practice. I take your example to show my approach:
(a + b)*ab
First, think about the language of the regular expression. If its difficult to sate what is the language description at first attempt, then find what is the smallest possible strings can be generate in language then find second smallest.....
Keep memorized solution of some basic regular expressions. For example, I have written here some basic idea to writing left-linear and right-linear grammars directly from regular expression. Similarly you can write for construing minimized dfa.
In RE (a + b)*ab, the smallest possible string is ab because using (a + b)* one can generate NULL(^) string. Second smallest string can be either aab or bab. Now one thing we can easily notice about language is that any string in language of this RE always ends with ab (suffix), Whereas prefix can be any possible string consist of a and b including ^.
Also, if current symbol is a; then one possible chance is that next symbol would be a b and string end. Thus in dfa we required, a transition such that when ever a b symbol comes after symbol a, then it should be move to some of the final state in dfa.
Next, if a new symbol comes on final state then we should move to some non-final state because any symbol after b is possible only in middle of some string in language as all language string terminates with suffix 'ab'.
So with this knowledge at this stage we can draw an incomplete transition diagram like below:
--►(Q0)---a---►(Q1)---b----►((Qf))
Now at this point you need to understand: every state has some meaning for example
(Q0) means = Start state
(Q1) means = Last symbol was 'a', and with one more 'b' we can shift to a final state
(Qf) means = Last two symbols was 'ab'
Now think what happens if a symbol a comes on final state. Just more to state Q1 because this state means last symbol was a. (updated transition diagram)
--►(Q0)---a---►(Q1)---b----►((Qf))
▲-----a--------|
But suppose instead of symbol a a symbol b comes at final state. Then we should move from final state to some non-final state. In present transition graph in this situation we should make a move to initial state from final state Qf.(as again we need ab in string for acceptation)
--►(Q0)---a---►(Q1)---b----►((Qf))
▲ ▲-----a--------|
|----------------b--------|
This graph is still incomplete! because there is no outgoing edge for symbol a from Q1. And for symbol a on state Q1 a self loop is required because Q1 means last symbol was an a.
a-
||
▼|
--►(Q0)---a---►(Q1)---b----►((Qf))
▲ ▲-----a--------|
|----------------b--------|
Now I believe all possible out-going edges are present from Q1 & Qf in above graph. One missing edge is an out-going edge from Q0 for symbol b. And there must be a self loop at state Q0 because again we need a sequence of ab so that string can be accept. (from Q0 to Qf shift is possible with ab)
b- a-
|| ||
▼| ▼|
--►(Q0)---a---►(Q1)---b----►((Qf))
▲ ▲-----a--------|
|----------------b--------|
Now DFA is complete!
Off-course the method might look difficult at first few tries. But if you learn to draw this way you will observe improvement in your analytically skills. And you will find this method is quick and objective way to draw DFA.
* In the link I given, I described some more regular expressions, I would highly encourage you to learn them and try to make DFA for those regular expressions too.

Algorithm for Negating Sentences

I was wondering if anyone was familiar with any attempts at algorithmic sentence negation.
For example, given a sentence like "This book is good" provide any number of alternative sentences meaning the opposite like "This book is not good" or even "This book is bad".
Obviously, accomplishing this with a high degree of accuracy would probably be beyond the scope of current NLP, but I'm sure there has been some work on the subject. If anybody knows of any work, care to point me to some papers?
While I'm not aware of any work that specifically looks at automatically generating negated sentences, I imagine a good place to start would be to read up on linguistics work in formal semantics and pragmatics. A good accessible introduction would be Steven C. Levinson's Pragmatics book.
One issue that I think you'll run into is that it can be very difficult to negate all the information that is conveyed by a sentence. For example, take:
John fixed the vase that he broke.
Even if you change this to John did not fix the vase that he broke, there is a presupposition that there is a vase and that John broke it.
Similarly, simply negating the sentence John did not stopped using drugs as John stopped using drugs still conveys that John, at one point, used drugs. A more thorough negation would be John never used drugs.
Some existing natural language processing (NLP) work that you might want to look at is MacCartney and Manning 2007's Natural Logic for Textual Inference. In this paper they use George Lakoff's notion of Natural Logic and Sanchez Valencia's monotonicity calculus to create software that automatically determines whether one sentence entails another. You could probably use some their techniques for detecting non-entailment to artificially construct negated and contradicting sentences.
I'd recommend checking out wordnet. You can use it to lookup antonyms for a word, so you could conceivably replace "bad" with "not good" since bad is an antonym of good. NLTK has a simple python interface to wordnet.
The naïve way of course, is to try to add "not" right after {am,are,is}. I have no idea how this will work in your setting though, it will probably only work with predicate-like sentences.
For simple sentences parse looking for adverbs or adjectives given the English grammar rules and substitute an antonym if only one meaning exists. Otherwise use the correct English negation rule to negate the verb (ie: is -> is not).
High level algorithm:
Look up each word for it's type (noun, verb, adjective, adverb, conjunction, etc...)
Infer sentence structure from word type sequences (Your sentence was: article, noun, verb, adjective/adverb; This is known to be a simple sentence.)
For simple sentences, choose one invertible word and invert it. Either by using an antonym, or negating the verb.
For more complex sentences, such as those with subordinate clauses, you will need to have more complex analysis, but for simple sentences, this shouldn't be infeasible.
There's a similar process for first-order logic. The usual algorithm is to map P to not P, and then perform valid translations to move the not somewhere convenient, e.g.:
Original: (not R(x) => exists(y) (O(y) and P(x, y)))
Negate it: not (not R(x) => exists(y) (O(y) and P(x, y)))
Rearrange: not (R(x) or exists(y) (O(y) and P(x, y)))
not R(x) and not exists(y) (O(y) and P(x, y))
not R(x) and forall(y) not (O(y) and P(x, y))
not R(x) and forall(y) (not O(y) or not P(x, y))
Performing the same on English you'd be negating "If it's not raining here, then there is some activity that is an outdoors activity and can be performed here" to "It is NOT the case that ..." and finally into "It's not raining and every possible activity is either not for outdoors or can't be performed here."
Natural language is a lot more complicated than first-order logic, of course... but if you can parse the sentence into something where the words "not", "and", "or", "exists" etc. can be identified, then you should be able to perform similar translations.
For a rule-based negation approach, you can take a look at the Python module negate1.
1 Disclaimer: I am the author of the module.
As for some papers related to the topic, you can take a look at:
Understanding by Understanding Not: Modeling Negation in Language Models
An Analysis of Natural Language Inference Benchmarks through the Lens of Negation
Not another Negation Benchmark: The NaN-NLI Test Suite for Sub-clausal Negation
Nice demos using NTLK - http://text-processing.com/demo and a short writeup - http://text-processing.com/demo/sentiment/.

Resources