Operating on a Regular Expression before applying Pumping Lemma - regular-language

When applying the pumping lemma to prove that a language is irregular, is it allowed to start by operating on the language to make applying the pumping lemma easier?
For example, when attempting to disprove:
L = {0^i 1^j | j <= i and i, j > 0}
Can I first reverse the language like so:
L = {1^j 0^i | j <= i and i, j > 0}
and reason that "if the language were regular, reversing (or performing any operation that's closed under regular languages) it wouldn't make it irregular," then continue to pump so that j must be greater than i?

Related

Use the pumping lemma to show that the following languages are not regular languages L = {anbm | n = 2m}

Use the pumping lemma to show that the following languages are not regular languages L = {an bm | n = 2m}
Choose a string a^2p b^p. The pumping lemma says we can write this as w = uvx such that |uv| <= p, |v| < 0 and for all natural numbers n, u(v^n)x is also a string in the language. Because |uv| <= p, the substring uv of w consists entirely of instances of the symbol a. Pumping up or down by choosing a value for n other than one guarantees that the number of a's in the resulting string will change, while the number of b's stays the same. Since the number of a's is twice the number of b's only when n = 1, this is a contradiction. Therefore, the language cannot be regular.
L={anbm|n=2m} Assume that L is regular Language Let the pumping length be p L={a2mbm} Since |s|=3m > m (total string length) take a string s S= aaaaa...aabbb....bbb (a2mbm times taken) Then, u = am-1 ; v= a ; w= ambm. && |uv|<=m Now If i=2 then S= am-1 a2 ambm = a2m-1bm Since here we are getting an extra a in the string S which is no belong to the given language a2mbm our assumption is wrong Therefore it is not a regular Language

what is this shift used in the simplified galil seiferas string match algorithm?

I'm self-studying problem 32-1 in CLRS; part c), presents the following algorithm for string matching:
REPETITION-MATCHER(P, T)
m = P.length
n = T.length
k = 1 + ρ'(P)
q = 0
s = 0
while s <= n-m
if T[s+q+1] == P[q+1]
q = q+1
if q==m
print "Pattern occurs with shift" s
if q==m or T[s+q+1] != P[q+1]
s = s+max(1, ceil(q/k))
q = 0
Here, ρ'(P), which is a function of P only, is defined as the largest integer r such that some prefix P[1..i] = y^r, e.g. a substring y repeated r times.
This algorithm appears to be 95 percent similar to the naive brute-force string matcher. However, the one part which greatly confuses me, and which seems to be the centerpiece of the entire algorithm, is the second to last line. Here, q is the number of characters of P matched so far. What is the rationale behind ceil(q/k)? It is completely opaque to me. It would have made more sense if that line were something like s = s + max(1+q, 1+i), where i is the length of the prefix that gives rise to ρ'(P).
CLRS claims that this algorithm is due to Galil and Seiferas, but in the reference they provide, I cannot find anything that resembles the algorithm provided above. It appears that reference contains, if anything, a much more advanced version of what is here. Can someone explain this ceil(q/k) value, and/or point me toward a reference that describes this particular algorithm, instead of the more well-known main Galil Seiferas paper?
Example #1:
Match aaaa in aaaaab, here ρ' = 4. Consider state:
aaaa ab
^
We have a mismatch here, and we want to move forward by one symbol, no more, because we will match full pattern again (last line sets q to zero). q = 4 and k = 5, so ceil(q/k) = 1, that's all right.
Example #2: Match abcd.abcd.abcd.X in abcd.abcd.abcd.abcd.X. Consider state:
abcd.abcd.abcd. abcd.X
^
We have a mismatch here, and we would like to move forward by five symbols. q = 15 and k = 4, so ceil(q/k) = 4. That's ok, it is almost 5, we still can match our pattern. Had we bigger ρ', say 10, we would have ceil(50/(10+1)) = 5.
Yeh, algorithms skips forward less symbols than KMP does, in case ρ'=10 its running time is O(10n+m) while KMP has O(n+m).
I figured out the proof of correctness.
let k = ρ'(P) + 1, and ρ'(P) is the largest possible repetition factor out of all the prefixes of P.
Suppose T[s+1..s+q] = P[1..q], and either q=m or T[s+q+1] != P[q+1]
Then, for 1 <= j <= floor(q/k) (except for the case q=m and m mod k = 0, in which the upper limit must be ceil(m/k)), we have
T[s+1..s+j] = P[1..j]
T[s+j+1..s+2j] = P[j+1..2j]
T[s+2j+1..s+3j] = P[2j+1..3j]
...
T[s+(k-1)j+1..s+kj] = P[(k-1)j+1..kj]
where not every quantity on every line is equal, since k cannot be a repetition factor, since the largest possible repetition factor out of any prefix of P is k-1.
Suppose we now make a comparison at shift s' = s+j, so that we will make the following comparisons
T[s+j+1..s+2j] with P[1..j]
T[s+2j+1..s+3j] with P[j+1..2j]
T[s+3j+1..s+4j] with P[2j+1..3j]
...
T[s+kj+1..s+(k+1)j] with P[(k-1)j+1..kj]
We claim that not every comparison can match, e.g. at least one of the above "with"s must be replaced with !=. We prove by contradiction. Suppose every "with" above is replaced by =. Then, comparing to the first set of comparisons we did, we would immediately have the following:
P[1..j] = P[j+1..2j]
P[j+1..2j] = [2j+1..3j]
...
P[(k-2)j+1..(k-1)j] = P[(k-1)j+1..kj]
However, this cannot be true, because k is not a repetition factor, hence a contradiction.
Hence, for any 1 <= j <= floor(q/k), testing a new shift s'=s+j is guaranteed to mismatch.
Hence, the smallest shift that is possible to result in a match is s + floor(q/k) + 1 >= ceil(q/k).
Note the code uses ceil(q/k) for simplicity, solely to deal with the case that q = m and m mod k = 0, in which case k * (floor(q/k)+1) would be greater than m, so only ceil(q/k) would do. However, when q mod k = 0 and q < m, then ceil(q/k) = floor(q/k), so is slightly suboptimal, since that shift is guaranteed to fail, and floor(q/k)+1 is the first shift that has any chance of matching.

Why pumping lemma for CFG doesn't work

Language:
{(a^i)(b^j)(c^k)(d^l) : i = 0 or j = k = l}
We take word
w = a^0 b^n c^n d^n
Which obviously belongs to the language because j = k = l
w = uvxyz
|vxy| <= n
|vy| > 1
and now v and y can be:
just a single character and if we pump single character the word is no longer in the language
two characters, count of the third will be lower so the word is not in the language
So, the proof that this language is not CF is not supposed to be do-able with standard pumping lemma, just with the ogdens lemma, but I don't understand why the proof above is invalid.
It doesn't work because in fact every pumped string is in the language, because you still have no as (that is, i=0).
And if you choose a string where i > 0, then you can't guarantee that v isn't just some number of as, and x is the empty string.

algorithm/code in R to find pattern from any position in a string

I want to find the pattern from any position in any given string such that the pattern repeats for a threshold number of times at least.
For example for the string "a0cc0vaaaabaaaabaaaabaa00bvw" the pattern should come out to be "aaaab". Another example: for the string "ff00f0f0f0f0f0f0f0f0000" the pattern should be "0f".
In both cases threshold has been taken as 3 i.e. the pattern should be repeated for at least 3 times.
If someone can suggest an optimized method in R for finding a solution to this problem, please do share with me. Currently I am achieving this by using 3 nested loops, and it's taking a lot of time.
Thanks!
Use regular expressions, which are made for this type of stuff. There may be more optimized ways of doing it, but in terms of easy to write code, it's hard to beat. The data:
vec <- c("a0cc0vaaaabaaaabaaaabaa00bvw","ff00f0f0f0f0f0f0f0f0000")
The function that does the matching:
find_rep_path <- function(vec, reps) {
regexp <- paste0(c("(.+)", rep("\\1", reps - 1L)), collapse="")
match <- regmatches(vec, regexpr(regexp, vec, perl=T))
substr(match, 1, nchar(match) / reps)
}
And some tests:
sapply(vec, find_rep_path, reps=3L)
# a0cc0vaaaabaaaabaaaabaa00bvw ff00f0f0f0f0f0f0f0f0000
# "aaaab" "0f0f"
sapply(vec, find_rep_path, reps=5L)
# $a0cc0vaaaabaaaabaaaabaa00bvw
# character(0)
#
# $ff00f0f0f0f0f0f0f0f0000
# [1] "0f"
Note that with threshold as 3, the actual longest pattern for the second string is 0f0f, not 0f (reverts to 0f at threshold 5). In order to do this, I use back references (\\1), and repeat these as many time as necessary to reach threshold. I need to then substr the result because annoyingly base R doesn't have an easy way to get just the captured sub expressions when using perl compatible regular expressions. There is probably a not too hard way to do this, but the substr approach works well in this example.
Also, as per the discussion in #G. Grothendieck's answer, here is the version with the cap on length of pattern, which is just adding the limit argument and the slight modification of the regexp.
find_rep_path <- function(vec, reps, limit) {
regexp <- paste0(c("(.{1,", limit,"})", rep("\\1", reps - 1L)), collapse="")
match <- regmatches(vec, regexpr(regexp, vec, perl=T))
substr(match, 1, nchar(match) / reps)
}
sapply(vec, find_rep_path, reps=3L, limit=3L)
# a0cc0vaaaabaaaabaaaabaa00bvw ff00f0f0f0f0f0f0f0f0000
# "a" "0f"
find.string finds substring of maximum length subject to (1) substring must be repeated consecutively at least th times and (2) substring length must be no longer than len.
reps <- function(s, n) paste(rep(s, n), collapse = "") # repeat s n times
find.string <- function(string, th = 3, len = floor(nchar(string)/th)) {
for(k in len:1) {
pat <- paste0("(.{", k, "})", reps("\\1", th-1))
r <- regexpr(pat, string, perl = TRUE)
if (attr(r, "capture.length") > 0) break
}
if (r > 0) substring(string, r, r + attr(r, "capture.length")-1) else ""
}
and here are some tests. The last test processes the entire text of James Joyce's Ulysses in 1.4 seconds on my laptop:
> find.string("a0cc0vaaaabaaaabaaaabaa00bvw")
[1] "aaaab"
> find.string("ff00f0f0f0f0f0f0f0f0000")
[1] "0f0f"
>
> joyce <- readLines("http://www.gutenberg.org/files/4300/4300-8.txt")
> joycec <- paste(joyce, collapse = " ")
> system.time(result <- find.string2(joycec, len = 25))
user system elapsed
1.36 0.00 1.39
> result
[1] " Hoopsa boyaboy hoopsa!"
ADDED
Although I developed my answer before having seen BrodieG's, as he points out they are very similar to each other. I have added some features of his to the above to get the solution below and tried the tests again. Unfortunately when I added the variation of his code the James Joyce example no longer works although it does work on the other two examples shown. The problem seems to be in adding the len constraint to the code and may represent a fundamental advantage of the code above (i.e. it can handle such a constraint and such constraints may be essential for very long strings).
find.string2 <- function(string, th = 3, len = floor(nchar(string)/th)) {
pat <- paste0(c("(.", "{1,", len, "})", rep("\\1", th-1)), collapse = "")
r <- regexpr(pat, string, perl = TRUE)
ifelse(r > 0, substring(string, r, r + attr(r, "capture.length")-1), "")
}
> find.string2("a0cc0vaaaabaaaabaaaabaa00bvw")
[1] "aaaab"
> find.string2("ff00f0f0f0f0f0f0f0f0000")
[1] "0f0f"
> system.time(result <- find.string2(joycec, len = 25))
user system elapsed
0 0 0
> result
[1] "w"
REVISED The James Joyce test that was supposed to be testing find.string2 was actually using find.string. This is now fixed.
Not optimized (even it is fast) function , but I think it is more R way to do this.
Get all patterns of certains length > threshold : vectorized using mapply and substr
Get the occurrence of these patterns and extract the one with maximum occurrence : vectorized using str_locate_all.
Repeat 1-2 this for all lengths and tkae the one with maximum occurrence.
Here my code. I am creating 2 functions ( steps 1-2) and step 3:
library(stringr)
ss = "ff00f0f0f0f0f0f0f0f0000"
ss <- "a0cc0vaaaabaaaabaaaabaa00bvw"
find_pattern_length <-
function(length=1,ss){
patt = mapply(function(x,y) substr(ss,x,y),
1:(nchar(ss)-length),
(length+1):nchar(ss))
res = str_locate_all(ss,unique(patt))
ll = unlist(lapply(res,length))
list(patt = patt[which.max(ll)],
rep = max(ll))
}
get_pattern_threshold <-
function(ss,threshold =3 ){
res <-
sapply(seq(threshold,nchar(ss)),find_pattern_length,ss=ss)
res[,which.max(res['rep',])]
}
some tests:
get_pattern_threshold('ff00f0f0f0f0f0f0f0f0000',5)
$patt
[1] "0f0f0"
$rep
[1] 6
> get_pattern_threshold('ff00f0f0f0f0f0f0f0f0000',2)
$patt
[1] "f0"
$rep
[1] 18
Since you want at least three repetitions, there is a nice O(n^2) approach.
For each possible pattern length d cut string into parts of length d. In case of d=5 it would be:
a0cc0
vaaaa
baaaa
baaaa
baa00
bvw
Now look at each pairs of subsequent strings A[k] and A[k+1]. If they are equal then there is a pattern of at least two repetitions. Then go further (k+2, k+3) and so on. Finally you also check if suffix of A[k-1] and prefix of A[k+n] fit (where k+n is the first string that doesn't match).
Repeat it for each d starting from some upper bound (at most n/3).
You have n/3 possible lengths, then n/d strings of length d to check for each d. It should give complexity O(n (n/d) d)= O(n^2).
Maybe not optimal but I found this cutting idea quite neat ;)
For a bounded pattern (i.e not huge) it's best I think to just create all possible substrings first and then count them. This is if the sub-patterns can overlap. If not change the step fun in the loop.
pat="a0cc0vaaaabaaaabaaaabaa00bvw"
len=nchar(pat)
thr=3
reps=floor(len/2)
# all poss strings up to half length of pattern
library(stringr)
pat=str_split(pat, "")[[1]][-1]
str.vec=vector()
for(win in 2:reps)
{
str.vec= c(str.vec, rollapply(data=pat,width=win,FUN=paste0, collapse=""))
}
# the max length string repeated more than 3 times
tbl=table(str.vec)
tbl=tbl[tbl>=3]
tbl[which.max(nchar(names(tbl)))]
aaaabaa
3
NB Whilst I'm lazy and append/grow the str.vec here in a loop, for a larger problem I'm pretty sure the actual length of str.vec is predetermined by the length of the pattern if you care to work it out.
Here is my solution, it's not optimized (build vector with patterns <- c() ; pattern <- c(patterns, x) for example) and can be improve but simpler than yours, I think.
I can't understand which pattern exactly should (I just return the max) be returned but you can adjust the code to what you want exactly.
str <- "a0cc0vaaaabaaaabaaaabaa00bvw"
findPatternMax <- function(str){
nb <- nchar(str):1
length.patt <- rev(nb)
patterns <- c()
for (i in 1:length(nb)){
for (j in 1:nb[i]){
patterns <- c(patterns, substr(str, j, j+(length.patt[i]-1)))
}
}
patt.max <- names(which(table(patterns) == max(table(patterns))))
return(patt.max)
}
findPatternMax(str)
> findPatternMax(str)
[1] "a"
EDIT :
Maybe you want the returned pattern have a min length ?
then you can add a nchar.patt parameter for example :
nchar.patt <- 2 #For a pattern of 2 char min
nb <- nb[length.patt >= nchar.patt]
length.patt <- length.patt[length.patt >= nchar.patt]

Show that the following set over {a,b} is regular

Given the alphabet {a, b} we define Na(w) as the number of occurrences of a in the word w and similarly for Nb(w). Show that the following set over {a, b} is regular.
A = {xy | Na(x) = Nb(y)}
I'm having a hard time figuring out where to start solving this problem. Any information would be greatly appreciated.
Yes it is regular language!
Any string consists if a and b belongs the language A = {xy | Na(x) = Nb(y)}.
Example:
Suppose string is: w = aaaab we can divide this string into prefix x and suffix y
w = a aaab
--- -----
x y
Number of a in x is one, and number of b in in y is also one.
Similarly as string like: abaabaa can be broken as x = ab (Na(x) = 1) and y = aabaa (Nb(y) = 1).
Or w = bbbabbba as x = bbbabb (Na(x) = 1) and y = ba (Nb(y) = 1)
Or w = baabaab as x = baa and y = baab with (Na(x) = 2) and (Nb(y) = 2).
So you can always break a string consist of a and b into prefix x and suffix y such that Na(x) = (Nb(y).
Formal Prrof:
Note: A strings consists of only as or consist of bs doesn't belongs to languagr e.g. aa, a, bbb...
Lets defined new Lagrange CA such that CA = {xy | Na(x) != Nb(y)}. CA stands for complement of A consists of string consists of only as or only bs.
1And CA is a regular language it's regular expression is a+ + b+.
Now as we know CA is a regular language (it can be expression by regular expression and so DFA) and Complement of any regular language is Regular hence language A is also regular language!
To construct DFA for complement language refer: Finding the complement of a DFA? and to write regular expression for DFA refer following two techniques.
How to write regular expression for a DFA
How to write regular expression for a DFA using Arden theorem
'+' Operator in Regular Expression in formal languages
PS: Btw regular expression for A = {xy | Na(x) = Nb(y)} is (a + b)*a(a + b)*b(a + b)*.
First, find out how to prove that a set is regular.
One way is to define a finite state machine that accepts the language.
Second: maybe think about why the set is not regular.
Hint: A = {a, b}*.
Try proving it by induction on length of word, or by finding the shortest word not in A.

Resources