How can I remove the last symbol of a DFA - regular-language

How to construct a DFA that accepts all the strings from another DFA and removes the last symbol
For example the given machine accepts string abb, how to construct a DFA that accepts string ab

Related

return only chars from the string in python

I am looking to extract only chars from the given string. but my query is doing exactly opposite
s= "A man, a plan, a canal: Panama"
newS = ''.join(re.findall("[^a-zA-Z]*", s))
print(newS) // my o/p: , , :
expected o/p string is:
"A man a plan a canal Panama"
Your regular expression is inverting the match - that's what the caret symbol (^) does inside square brackets (negated character class). You first need to remove that.
Next, you should be matching a sequence of one or more characters (+) rather than zero or more characters (*) -- using * will match the empty string, which you don't want in this case.
Finally your join should join with a space to get the intended output, rather than an empty string -- which won't retain the spaces between the words.
newS = ' '.join(re.findall(r'[a-zA-Z]+', s))
Though not essential in this case, its advised to use raw strings for regular expressions (r). More in this post.
Full working code:
import re
s = 'A man, a plan, a canal: Panama'
newS = ' '.join(re.findall(r'[a-zA-Z]+', s))
print(newS)

Find the minimal lexographical string formed by merging two strings

Suppose we are given two strings s1 and s2(both lowercase). We have two find the minimal lexographic string that can be formed by merging two strings.
At the beginning , it looks prettty simple as merge of the mergesort algorithm. But let us see what can go wrong.
s1: zyy
s2: zy
Now if we perform merge on these two we must decide which z to pick as they are equal, clearly if we pick z of s2 first then the string formed will be:
zyzyy
If we pick z of s1 first, the string formed will be:
zyyzy which is correct.
As we can see the merge of mergesort can lead to wrong answer.
Here's another example:
s1:zyy
s2:zyb
Now the correct answer will be zybzyy which will be got only if pick z of s2 first.
There are plenty of other cases in which the simple merge will fail. My question is Is there any standard algorithm out there used to perform merge for such output.
You could use dynamic programming. In f[x][y] store the minimal lexicographical string such that you've taken x charecters from the first string s1 and y characters from the second s2. You can calculate f in bottom-top manner using the update:
f[x][y] = min(f[x-1][y] + s1[x], f[x][y-1] + s2[y]) \\ the '+' here represents
\\ the concatenation of a
\\ string and a character
You start with f[0][0] = "" (empty string).
For efficiency you can store the strings in f as references. That is, you can store in f the objects
class StringRef {
StringRef prev;
char c;
}
To extract what string you have at certain f[x][y] you just follow the references. To udapate you point back to either f[x-1][y] or f[x][y-1] depending on what your update step says.
It seems that the solution can be almost the same as you described (the "mergesort"-like approach), except that with special handling of equality. So long as the first characters of both strings are equal, you look ahead at the second character, 3rd, etc. If the end is reached for some string, consider the first character of the other string as the next character in the string for which the end is reached, etc. for the 2nd character, etc. If the ends for both strings are reached, then it doesn't matter from which string to take the first character. Note that this algorithm is O(N) because after a look-ahead on equal prefixes you know the whole look-ahead sequence (i.e. string prefix) to include, not just one first character.
EDIT: you look ahead so long as the current i-th characters from both strings are equal and alphabetically not larger than the first character in the current prefix.

algorithm for finding all substrings from a specific alphabet in a string in O(m+n) time

Given a string S . Find all maximal substrings that contains chars from alphabet A in O(|S|+|A|) time. "Maixmal susbstring" is a substring of S, surrounded by chars that are not in alphabet A, or string boundaries.
example:
S = rerwmkwerewkekbvverqwewevbvrewqwmkwe
A = {w,r,e}
answer: rerw, werew, e, er, wewe, rew, w, we
Can you help?
Mapping your input to the output that you've provided here is one way to do it.
Just take the string characters one at a time and keep matching it to the alphabets in A.
Use a binary hash-table having 26 values based on alphabet.
Note: If capitals are included too hash them to their small letter counterparts for case-insensitivity and and double the hash table size for case-sensitivity.
If a value matches move on and concatenate this to previous sub-string.
If there is a miss, then break the sub-string, save it and start fresh with the next match.
Without the hash-table it would take O(m*n) time but now it'll take O(m) for hashing plus O(n) for traversing that is O(m+n) time.
Similar to what others have suggested, but in pseudocode form:
A = boolean array
for each c in the alphabet
set A[c] = true
L = stack of strings containing your solution
for each character c of S
if A contains c
append c to the top string of stack L
else
push empty string onto stack L
return L
Creating A will take O(n) and iteration through S will take O(m).

Burrows-Wheeler Transform without EOF character

I need to perform a well-known Burrows-Wheeler Transform in linear time. I found a solution with suffix sorting and EOF character, but appending EOF changes the transformation. For example: consider the string bcababa and two rotations
s1 = abababc
s2 = ababcab
it's clear that s1 < s2. Now with an EOF character:
s1 = ababa#bc
s2 = aba#bcab
and now s2 < s1. And the resulting transformation will be different. How can I perform BWT without EOF?
You can perform the transform in linear time and space without the EOF character by computing the suffix array of the string concatenated with itself. Then iterate over the suffix array. If the current suffix array value is less than n, add to your output array the last character of the rotation starting at the position denoted by the current value in the suffix array. This approach will produce a slightly different BWT transform result, however, since the string rotations aren't sorted as if the EOF character were present.
A more thorough description can be found here: http://www.quora.com/Algorithms/How-I-can-optimize-burrows-wheeler-transform-and-inverse-transform-to-work-in-O-n-time-O-n-space
You need to have EOF character in the string for BWT to work, because otherwise you can't perform the inverse transform to get the original string back. Without EOF, both strings "ba" and "ab" have the same transformed version ("ba"). With EOF, the transforms are different
ab ba
a b | a | b
b | a b a |
| a b | b a
i.e. ab transforms to "|ab" and ba to "b|a".
EOF is needed for BWT because it marks the point where the character cycle starts.
Re: doing it without the EOF character, according to Wikipedia,
Since any rotation of the input string will lead to the same
transformed string, the BWT cannot be inverted without adding an 'EOF'
marker to the input or, augmenting the output with information, such
as an index, that makes it possible to identify the input string from
the class of all of its rotations.
There is a bijective version of the transform, by which the
transformed string uniquely identifies the original. In this version,
every string has a unique inverse of the same length.
The bijective transform is computed by first factoring the input into
a non-increasing sequence of Lyndon words; such a factorization exists
by the Chen–Fox–Lyndon theorem, and can be found in linear time.
Then, the algorithm sorts together all the rotations of all of these
words; as in the usual Burrows–Wheeler transform, this produces a
sorted sequence of n strings. The transformed string is then obtained
by picking the final character of each of these strings in this sorted
list.
I know this thread is quite old but I had the same problem and came up with the following solution:
Find the lexicographical minimal string rotation and save the offset (needed to reverse) (I use the lydon factorization)
Use the normal bwt algorithms on the rotated string (this produces the right output because all algos asume that the string is followed by the lexicographically minimal char)
To reverse: unbwt using e.g. backward search starting at index 0 and write the corrosponding char to the saved offset

Representing the strings we use in programming in math notation

Now I'm a programmer who's recently discovered how bad he is when it comes to mathematics and decided to focus a bit on it from that point forward, so I apologize if my question insults your intelligence.
In mathematics, is there the concept of strings that is used in programming? i.e. a permutation of characters.
As an example, say I wanted to translate the following into mathematical notation:
let s be a string of n number of characters.
Reason being I would want to use that representation in find other things about string s, such as its length: len(s).
How do you formally represent such a thing in mathematics?
Talking more practically, so to speak, let's say I wanted to mathematically explain such a function:
fitness(s,n) = 1 / |n - len(s)|
Or written in more "programming-friendly" sort of way:
fitness(s,n) = 1 / abs(n - len(s))
I used this function to explain how a fitness function for a given GA works; the question was about finding strings with 5 characters, and I needed the solutions to be sorted in ascending order according to their fitness score, given by the above function.
So my question is, how do you represent the above pseudo-code in mathematical notation?
You can use the notation of language theory, which is used to discuss things like regular languages, context free grammars, compiler theory, etc. A quick overview:
A set of characters is known as an alphabet. You could write: "Let A be the ASCII alphabet, a set containing the 128 ASCII characters."
A string is a sequence of characters. ε is the empty string.
A set of strings is formally known as a language. A common statement is, "Let s ∈ L be a string in language L."
Concatenating alphabets produces sets of strings (languages). A represents all 1-character strings, AA, also written A2, is the set of all two character strings. A0 is the set of all zero-length strings and is precisely A0 = {ε}. (It contains exactly one string, the empty string.)
A* is special notation and represents the set of all strings over the alphabet A, of any length. That is, A* = A0 ∪ A1 ∪ A2 ∪ A3 ... . You may recognize this notation from regular expressions.
For length use absolute value bars. The length of a string s is |s|.
So for your statement:
let s be a string of n number of characters.
You could write:
Let A be a set of characters and s ∈ An be a string of n characters. The length of s is |s| = n.
Mathematically, you have explained fitness(s, n) just fine as long as len(s) is well-defined.
In CS texts, a string s over a set S is defined as a finite ordered list of elements of S and its length is often written as |s| - but this is only notation, and doesn't change the (mathematical) meaning behind your definition of fitness, which is pretty clear just how you've written it.

Resources