Regular expressions for strings not containing specific substring

Regular expressions for strings not containing specific substring - regular-language

What could be the regular expression for - All words that do not have the substring baa for alphabet set ={a,b}?
Is it:
a* ((aa) * b *)?
Can a string of length 2 be acceptable for the above condition to hold?

a*(ba?)*
At start, it can go with arbitrarily many a's, but once a b has been introduced, not more than a single isolated a is allowed to appear anywhere hereupon.

a*(b+(ba))*
By grammar, once b reached, there can be many b occurrences or if there is an a after b, it must end or follow by b or by ba.

Related

Efficient way to check if string A is contained in string B with at most k errors

Given a string A and a string B (A shorter or the same length as B), I would like to check whether B contains a substring A' such that the Hamming distance between A and A' is at most k.
Does anyone know of an efficient algorithm to do this? Obviously I can just run a sliding window, but this is not feasible for the amount of data I'm working with. The Knuth-Morris-Pratt algorithm (https://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm) would work when k=0, but I don't know whether it's modifiable to account for k>0.
Thanks!
Edit: I apparently forgot to clarify, I am looking for a consecutive substring, so for example the substring from position 3 to position 7, without skipping characters. So levenshtein distance is not applicable.

This is what you are looking for : https://en.wikipedia.org/wiki/Levenshtein_distance

If you use the Levenshtein distance and k=1, then you can use the fact that if the length of A is 2n+1 or 2n+2, then either the first or the last n characters of A must be in B.
So you can use strstr to find all places in B where the first or last n characters match exactly and then check the Levenshtein distance.
Special case A = 1 characters: matches everywhere with one error. Special case where A = 2 characters ab: call strchr(a), if it fails call strchr(b).

How do you interpret it? (u∈Σ∗)

Here is the full rule
{a^k u a^k| k≥1, u∈Σ∗}
does this mean either single a or single b or any combinations of a and b from the language can be replaced in u?
So if k=1 then is it aaa | aba OR a(aba)a | a(ba)a
Thanks
Rahman

This rule means every string in the language has the same number of a's at the beginning as at the end, with whatever you want (including more a's) between.
So aaa, aba, aabaa and abaa are all in the language (assuming b is in Σ).
In fact, it is enough that the string is at least 2 characters long and there is an a at either end (left as an exercise).

How to find all strings that do not contain substring palindromes

Disclaimer: This is a problem lifted from HackerRank, but their editorial answer wasn't sufficient so I hoped to get better answers. If it's against any policy, please let me know and I'll take this down.
Problem:
You are given two integers, N and M. Count the number of strings of length N under the alphabet set of size M that doesn't contain any palindromic string of the length greater than 1 as a consecutive substring.
N=2,M=2 -> 2 :: AA, AB, BA, BB
N=2,M=3 -> 6 :: AA, AB, AC, BA, BB, BC, CA, CB, CC
ABCDE counts as it does not contain any palindromic substrings.
ABCCC does not count as it does contain "CCC", a palindrome of length >1.
Editorial
Here is the provided answer which I think is wrong:
For N>=3, there are (M−2) ways to choose any next symbol (after the first two) - basically it should not coincide with the previous and pred-previous symbols, that aren't equal.
If N=1, return M
If N=2, return M * (M-1)
If N>=3, return M * (M-1) * (M-2)^(N-2)
counterexample: N=4, M=3, "ABCC"
My Solution Try
When I was working on this problem, I tried to find all the strings that contained palindromic substrings and subtracting that from the total, M^N. I ran into a lot of problems with over counting. For example, "ABABA" has "ABA","BAB","ABA" of n=3, and "ABABA" of n=5.
Thanks for any help in elucidating this problem. I really hope for a good answer to figure this out!

Suppose you build up palindrome-free strings one letter at a time. For the first letter, you have M choices, and for the second, you have M-1, since you can't use the first letter. This much is obvious.
For every letter after the first two, you can't use the previous letter, and you can't use the letter before that, so that's two choices eliminated. What about the other letters? Well, if using one of those creates a palindrome, it would have to be a palindrome of length at least 4 - but if adding a letter creates a palindrome of length K+2 for K>=2, the string must already have had a palindrome of length K for the new palindrome to build off of. (For K<2, this is okay.) Since the string didn't have any palindromes of length >=2, we can conclude that adding any letter other than the previous two letters is fine.
Thus, we have M choices for the first letter, M-1 choices for the second, and M-2 for every letter after that.

Deterministic automata to find number of subsequence in string of another string

Deterministic automata to find number of subsequences in string ?
How can I construct a DFA to find number of occurence string as a subsequence in another string?
eg. In "ssstttrrriiinnngggg" we have 3 subsequences which form string "string" ?
also both string to be found and to be searched only contain characters from specific character Set .
I have some idea about storing characters in stack poping them accordingly till we match , if dont match push again .
Please tell DFA solution ?

OVERLAPPING MATCHES
If you wish to count the number of overlapping sequences then you simply construct a DFA that matches the string, e.g.
1 -(if see s)-> 2 -(if see t)-> 3 -(if see r)-> 4 -(if see i)-> 5 -(if see n)-> 6 -(if see g)-> 7
and then compute the number of ways of being in each state after seeing each character using dynamic programming. See the answers to this question for more details.
DP[a][b] = number of ways of being in state b after seeing the first a characters
= DP[a-1][b] + DP[a-1][b-1] if character at position a is the one needed to take state b-1 to b
= DP[a-1][b] otherwise
Start with DP[0][b]=0 for b>1 and DP[0][1]=1.
Then the total number of overlapping strings is DP[len(string)][7]
NON-OVERLAPPING MATCHES
If you are counting the number of non-overlapping sequences, then if we assume that the characters in the pattern to be matched are distinct, we can use a slight modification:
DP[a][b] = number of strings being in state b after seeing the first a characters
= DP[a-1][b] + 1 if character at position a is the one needed to take state b-1 to b and DP[a-1][b-1]>0
= DP[a-1][b] - 1 if character at position a is the one needed to take state b to b+1 and DP[a-1][b]>0
= DP[a-1][b] otherwise
Start with DP[0][b]=0 for b>1 and DP[0][1]=infinity.
Then the total number of non-overlapping strings is DP[len(string)][7]
This approach will not necessarily give the correct answer if the pattern to be matched contains repeated characters (e.g. 'strings').

Algorithm to form a given pattern using some strings

Given are 6 strings of any length. The words are to be arranged in the pattern shown below. They can be arranged either vertically or horizontally.
--------
| |
| |
| |
---------------
| |
| |
| |
--------
The pattern need not to be symmetric and there need to be two empty areas as shown.
For example:
Given strings
PQF
DCC
ACTF
CKTYCA
PGYVQP
DWTP
The pattern can be
DCC...
W.K...
T.T...
PGYVQP
..C..Q
..ACTF
where dot represent empty areas.
The other example is
RVE
LAPAHFUIK
BIRRE
KZGLPFQR
LLHU
UUZZSQHILWB
Pattern is
LLHU....
A..U....
P..Z....
A..Z....
H..S....
F..Q....
U..H....
I..I....
KZGLPFQR
...W...V
...BIRRE
If multiple patterns are possible then pattern with lexicographically smallest first line, then second line and so on is to be formed. What algorithm can be used to solve this?

Find strings which suits to this constraint:
strlen(a) + strlen(b) - 1 = strlen(c)
strlen(d) + strlen(e) - 1 = strlen(f)
After that try every possible situation if they are valid. For example;
aaa.....
d.f.....
d.f.....
d.f.....
cccccccc
..f....e
..f....e
..bbbbbb
There will be 2*2*2 = 8 different situation.

There are a number of heuristics that you can apply, but before that, let's go over some properties of the puzzle.
+aa+
c f
+ee+eee+
f d
+bbb+
Let us call the length of the string with the same character as appeared in the diagram above. We have:
a + b - 1 = e
c + d - 1 = f
I will refer to the 2 strings for the cross in the middle as middle strings.
We also infer that the length of the string cannot be less than 2. Therefore, we can infer:
e > a, e > b
f > c, f > d
From this, we know that the 2 shortest strings cannot be middle strings, due to the inequality above.
The 3 largest strings cannot be equal also, since after choosing any of 3 string as middle string, we are left with 2 largest strings that are equal, and it is impossible according to the inequality above.
The puzzle is only tricky when the lengths are regular. When the lengths are irregular, you can do direct mapping from length to position.
If we have the 2 largest strings being equal, due to the inequality above, they are the 2 middle strings. The worst case for this one is a "regular" puzzle, where the length a, b, c, d are equal.
If the 2 largest strings are unequal, the largest string's position can be determined immediately (since its length is unique in the puzzle) - as one of the middle string. In worst case, there can be 3 candidates for the other middle string - just brute force and check all of them.
Algorithm:
Try to map unique length string to the position.
Brute force the 2 strings in the middle (taken into consideration what I mentioned above), and brute force to fill in the rest.
Even with stupid brute force, there are only 6! = 720 cases, if the string can only go from left to right, up to down (no reverse). There will be 46080 cases (* 2^6) if the string is allowed to be in any direction.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string