generalizing the pumping lemma for UNIX-style regular expressions

generalizing the pumping lemma for UNIX-style regular expressions - regular-language

Most UNIX regular expressions have, besides the usual **,+,?* operators a backslash operator where \1,\2,... match whatever's in the last parentheses, so for example *L=(a*)b\1* matches the (non regular) language *a^n b a^n*.
On one hand, this seems to be pretty powerful since you can create (a*)b\1b\1 to match the language *a^n b a^n b a^n* which can't even be recognized by a stack automaton. On the other hand, I'm pretty sure *a^n b^n* cannot be expressed this way.
I have two questions:
Is there any literature on this family of languages (UNIX-y regular). In particular, is there a version of the pumping lemma for these?
Can someone prove, or disprove, that *a^n b^n* cannot be expressed this way?

You're probably looking for
Benjamin Carle and Paliath Narendran "On Extended Regular Expressions" LNCS 5457
DOI:10.1007/978-3-642-00982-2_24
PDF Extended Abstract at http://hal.archives-ouvertes.fr/docs/00/17/60/43/PDF/notes_on_extended_regexp.pdf
C. Campeanu, K. Salomaa, S. Yu: A formal study of practical regular expressions, International Journal of Foundations of Computer Science, Vol. 14 (2003) 1007 - 1018.
DOI:10.1142/S012905410300214X
and of course follow their citations forward and backward to find more literature on this subject.

a^n b^n is CFL. The grammar is
A -> aAb | e
you can use pumping lemma for RL to prove A is not RL

Ruby 1.9.1 supports the following regex:
regex = %r{ (?<foo> a\g<foo>a | b\g<foo>b | c) }x
p regex.match("aaacbbb")
# the result is #<MatchData "c" foo:"c">
"Fun with Ruby 1.9 Regular Expressions" has an example where he actually arranges all the parts of a regex so that it looks like a context-free grammar as follows:
sentence = %r{
(?<subject> cat | dog | gerbil ){0}
(?<verb> eats | drinks| generates ){0}
(?<object> water | bones | PDFs ){0}
(?<adjective> big | small | smelly ){0}
(?<opt_adj> (\g<adjective>\s)? ){0}
The\s\g<opt_adj>\g<subject>\s\g<verb>\s\g<opt_adj>\g<object>
}x
I think this means that at least Ruby 1.9.1's regex engine, which is the Oniguruma regex engine, is actually equivalent to a context-free grammar, though the capturing groups aren't as useful as an actual parser-generator.
This means that "Pumping lemma for context-free languages" should describe the class of languages recognizable by Ruby 1.9.1's regex engine.
EDIT: Whoops! I messed up, and didn't do an important test which actually makes my answer above totally wrong. I won't delete the answer, because it's useful information nonetheless.
regex = %r{\A(?<foo> a\g<foo>a | b\g<foo>b | c)\Z}x
#I added anchors for the beginning and end of the string
regex.match("aaacbbb")
#returns nil, indicating that no match is possible with recursive capturing groups.
EDIT: Coming back to this many months later, I just discovered that my test in the last edit was incorrect. "aaacbbb" shouldn't be expected to match regex even if regex does operate like a context-free grammar.
The correct test should be on a string like "aabcbaa", and that does match the regex:
regex = %r{\A(?<foo> a\g<foo>a | b\g<foo>b | c)\Z}x
regex.match("aaacaaa")
# => #<MatchData "aaacaaa" foo:"aaacaaa">
regex.match("aacaa")
# => #<MatchData "aacaa" foo:"aacaa">
regex.match("aabcbaa")
# => #<MatchData "aabcbaa" foo:"aabcbaa">

Related

Repeating Pattern Matching in antlr4

I'm trying to write a lexer rule that would match following strings
a
aa
aaa
bbbb
the requirement here is all characters must be the same
I tried to use this rule:
REPEAT_CHARS: ([a-z])(\1)*
But \1 is not valid in antlr4. is it possible to come up with a pattern for this?

You can’t do that in an ANTLR lexer. At least, not without target specific code inside your grammar. And placing code in your grammar is something you should not do (it makes it hard to read, and the grammar is tied to that language). It is better to do those kind of checks/validations inside a listener or visitor.
Things like back-references and look-arounds are features that krept in regex-engines of programming languages. The regular expression syntax available in ANTLR (and all parser generators I know of) do not support those features, but are true regular languages.
Many features found in virtually all modern regular expression libraries provide an expressive power that far exceeds the regular languages. For example, many implementations allow grouping subexpressions with parentheses and recalling the value they match in the same expression (backreferences). This means that, among other things, a pattern can match strings of repeated words like "papa" or "WikiWiki", called squares in formal language theory.
-- https://en.wikipedia.org/wiki/Regular_expression#Patterns_for_non-regular_languages

Is Rust's lexical grammar regular, context-free or context-sensitive?

The lexical grammar of most programming languages is fairly non-expressive in order to quickly lex it. I'm not sure what category Rust's lexical grammar belongs to. Most of it seems regular, probably with the exception of raw string literals:
let s = r##"Hi lovely "\" and "#", welcome to Rust"##;
println!("{}", s);
Which prints:
Hi lovely "\" and "#", welcome to Rust
As we can add arbitrarily many #, it seems like it can't be regular, right? But is the grammar at least context-free? Or is there something non-context free about Rust's lexical grammar?
Related: Is Rust's syntactical grammar context-free or context-sensitive?

The raw string literal syntax is not context-free.
If you think of it as a string surrounded by r#k"…"#k (using the superscript k as a count operator), then you might expect it to be context-free:
raw_string_literal
: 'r' delimited_quoted_string
delimited_quoted_string
: quoted_string
| '#' delimited_quoted_string '#'
But that is not actually the correct syntax, because the quoted_string is not allowed to contain "#k although it can contain "#j for any j<k
Excluding the terminating sequence without excluding any other similar sequence of a different length cannot be accomplished with a context-free grammar because it involves three (or more) uses of the k-repetition in a single production, and stack automata can only handle two. (The proof that the grammar is not context-free is surprisingly complicated, so I'm not going to attempt it here for lack of MathJax. The best proof I could come up with uses Ogden's lemma and the uncommonly cited (but highly useful) property that context-free grammars are closed under the application of a finite-state transducer.)
C++ raw string literals are also context-sensitive [or would be if the delimiter length were not limited, see Note 1], and pretty well all whitespace-sensitive languages (like Python and Haskell) are context-sensitive. None of these lexical analysis tasks is particularly complicated so the context-sensitivity is not a huge problem, although most standard scanner generators don't provide as much assistance as one might like. But there it is.
Rust's lexical grammar offers a couple of other complications for a scanner generator. One issue is the double meaning of ', which is used both to create character literals and to mark lifetime variables and loop labels. Apparently it is possible to determine which of these applies by considering the previously recognized token. That could be solved with a lexical scanner which is capable of generating two consecutive tokens from a single pattern, or it could be accomplished with a scannerless parser; the latter solution would be context-free but not regular. (C++'s use of ' as part of numeric literals does not cause the same problem; the C++ tokens can be recognized with regular expressions, because the ' can not be used as the first character of a numeric literal.)
Another slightly context-dependent lexical issue is that the range operator, .., takes precedence over floating point values, so that 2..3 must be lexically analysed as three tokens: 2 .. 3, rather than two floating point numbers 2. .3, which is how it would be analysed in most languages which use the maximal munch rule. Again, this might or might not be considered a deviation from regular expression tokenisation, since it depends on trailing context. But since the lookahead is at most one character, it could certainly be implemented with a DFA.
Postscript
On reflection, I am not sure that it is meaningful to ask about a "lexical grammar". Or, at least, it is ambiguous: the "lexical grammar" might refer to the combined grammar for all of the languages "tokens", or it might refer to the act of separating a sentence into tokens. The latter is really a transducer, not a parser, and suggests the question of whether the language can be tokenised with a finite-state transducer. (The answer, again, is no, because raw strings cannot be recognized by a FSA, or even a PDA.)
Recognizing individual tokens and tokenising an input stream are not necessarily equivalent. It is possible to imagine a language in which the individual tokens are all recognized by regular expressions but an input stream cannot be handled with a finite-state transducer. That will happen if there are two regular expressions T and U such that some string matching T is the longest token which is a strict prefix of an infinite set of strings in U. As a simple (and meaningless) example, take a language with tokens:
a
a*b
Both of these tokens are clearly regular, but the input stream cannot be tokenized with a finite state transducer because it must examine any sequence of as (of any length) before deciding whether to fallback to the first a or to accept the token consisting of all the as and the following b (if present).
Few languages show this pathology (and, as far as I know, Rust is not one of them), but it is technically present in some languages in which keywords are multiword phrases.
Notes
Actually, C++ raw string literals are, in a technical sense, regular (and therefore context free) because their delimiters are limited to strings of maximum length 16 drawn from an alphabet of 88 characters. That means that it is (theoretically) possible to create a regular expression consisting of 13,082,362,351,752,551,144,309,757,252,761 patterns, each matching a different possible raw string delimiter.

How to prove the correctness of a given grammar?

I am wondering how programming langauge developers validate and prove that their grammar is correct. Suppose that I created a new grammar for a new langauge. I can test my grammar with a unit test tool by providing different kinds of test programs. However, I will never 100% ensure that my grammar is correct. How do language developers ensure that their grammar is correct in real world?
Let's say I created a grammar for a new language using pencil and paper. However, I did a mistake and my grammar accepts the expressions that end with a + like 2+2+. I will implement my language using this incorrect grammar, if I don't find the mistake in it. After implementation and unit testing, I can find the error. Is it possible to find it before starting any implementation?
Definitely, I can try my grammar with some sample inputs using pencil and paper (derivation etc.), but I may miss some corner cases. Is there a better approach or how in the real language developers test their grammar?

A proof is a logical argument that demonstrates the truth of a claim. There are as many ways to prove something as there are ways of thinking about a problem. A common way to prove things about discrete structures (like grammars) is using mathematical induction. Basically, you show that something is true in base cases - the simplest cases possible - and then show that if it's true for all cases under a certain size, it must therefore be true for cases of the next size.
In our case: suppose we wanted only to prove your grammar didn't generate + at the end of a word. We could do induction on the number of productions used in constructing a string in the language. We would identify all relevant base cases, show the property holds for these strings, and then show that longer strings in the language are constructed in such a way that it is impossible to get a + at the end. Here's an example.
S := S + S | (S) | x
Base case: the shortest string in the language is x, generated as S -> x. It does not end with a +.
Induction hypothesis: assume all strings produced using up to and including k productions do not end with +.
Induction step: we must show strings produced using more than k productions do not end with +. If we apply the rule (S) to any string generated from S, we do not add + so the property holds. If we apply S + S to strings generated from S, the last symbol in S + S is the last symbol of a shorter string (at least 2 symbols shorter) generated by S. By the induction hypothesis, that string did not end in +, so neither does this one. There are no other productions, so no string in the language ends in +. QED

Regular expression with quantity dependence

Is it possible to create regular expression that will describe words like:
aabc
aaaabcbc
aaaaaabcbcbc
?
Words are created like each occurrence of (bc) on the right is connected with occurrence (aa) on the left.
Words below are not valid:
aa
bc
aaabc
aaaabc
aabcbc

No it cannot be expressed in terms of regular expressions. That is because, your expression, requires a number of "aa" followed by equal number of "bc". This requires infinite memory. FA do not have infinite memory.
It can be expressed in context-free grammar.-
S -> aaSbc | έ Epsilon stands for string of zero length(empty string).
This generates strings like -
Valid string - έ(Empty string), aabc, aaaabcbc and so on.
Read more about context free and regular grammar here.

Just fyi it would be possible in non regular languages.
For example you could play with balancing groups in .NET and get something like this:
^(?<a>aa)+(bc(?<-a>))+(?(a)(?!))$

how can i see directly that a language is not regular

given L={a^n b^n c^n}, how can i say directly without looking at production rules that this language is not regular? i can use pumping lemma but some guys are saying just looking at the grammar that this is not regular one. how is it possible?

You have three chars in your alphabet. All of them depends on the same variable: n.
Now, if you have only two of them, imagine {a^n b^n} you can easily accomplish the task with this production:
S -> ab | aSb
But you have three of them and there's no way to link all of them to the same variable. You should use two syntax category, but since you do it, they are unlinked and you can generate different string from each one of them. The only way to link them is with only one syntax category, and that is impossible.
You can't do:
S -> abc | aSbc
In fact, you can't have a syntax category in your final string, so that is not a string. It needs to be transformed again. And what can you do from that point?
You can do:
aabcbc
or you can do:
aaSbcbc
The first one is a string, and isn't part of your language. The second is not a string, yet. But it's very easy to see that you can't manage to do any allowed string from that.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string