The most effective string replace alogrithm? - string

We know that most code editors implement the string search with the Boyer-Moore algorithm.How does it implements the string replace algorithm, Any idea?

I'm guessing that nowadays most text editors use either a single block of memory to hold the entire file, or an array of lines or blocks of larger size, each of which points to its own block of memory. (In the past there have been more interesting techniques employed. One way is to have all text to the left or above the cursor position "pressed against" the left end of a fixed-size buffer, and all text to the right or below "pressed against" the right end, with a gap of free space in the middle. Then the common operations of inserting or deleting characters can take place in constant time! Moving the cursor k positions to the right entails sliding k bytes from the left end of the right segment to the right end of the left segment, i.e. moving the cursor is now a linear time operation!)
Assuming that the text is stored in an "ordinary" way (i.e. not the left-right cursor-dependent buffer pair described above), there aren't too many ways to optimise replace operations, especially if the replacement text is longer than the search text -- in this case, there is no escaping the fact that the rest of the line/block/file must be shunted forward in memory for each replacement. The best you can do there is to avoid multiple O(n) copy operations when one will do -- i.e. don't delete the search string, then insert the replacement string one character at a time, shunting the rest of the line/block/document forward one character at a time, because the latter step will cost O(n^2) time. Instead, shunt the rest of the document text far enough forward to make room for the replacement string in one O(n) step.
If the replacement string is shorter than the search text, you can scan forward with two pointers from and to, always copying from one to the other. As replacements are made, to will start to lag behind from. This is safe because to <= from always holds, so you will never write over something you have to read later.
Actually, if the replacement string is longer than the search string, and no suffix of the search string is also a prefix of the search string, then you can safely scan backwards from the end in one O(n) pass. The suffix/prefix requirement is necessary to avoid situations like the following, which would produce different behaviour depending on the scan direction:
Search and replace "abcabc" with "xyz" in document text "abcabcabc":
S&R using forward algo gives: xyzabc
S&R using backward algo gives: abcxyz

Related

Important algorithm involving random access to a string?

I am implementing a different string representation where accessing a string in non-sequential manner is very costly. To avoid this I try to implement certain position caches or character blocks so one can jump to certain locations and scan from there.
In order to do so, I need a list of algorithms where scanning a string from right to left or random access of its characters is required, so I have a set of test cases to do some actual benchmarking and to create a model I can use to find a local/global optimum for my efforts.
Basically I know of:
String.charAt
String.lastIndexOf
String.endsWith
One scenario where one needs right to left access of strings is extracting the file extension and the file name (item) of paths.
For random access i find no algorithm at all unless one has prefix tables and access the string more randomly checking all those positions for longer than prefix strings.
Does anyone know other algorithms with either right to left or random access of string characters is required?
[Update]
The calculation of the hash-code of a String is calculated using every character and accessed from left to right along the value is stored in a local primary variable. So this is not something for random access.
Also the MD5 or CRC algorithm also all process the complete string. So I do not find any random access examples at all.
One interesting algorithm is Boyer-Moore searching, which involves both skipping forward by a variable number of characters and comparing backwards. If those two operations are not O(1), then KMP searching becomes more attractive, but BM searching is much faster for long search patterns (except in rare cases where the search pattern contains lots of repetitions of its own prefix). For example, BM shines for patterns which must be matched at word-boundaries.
BM can be implemented for certain variable-length encodings. In particular, it works fine with UTF-8 because misaligned false positives are impossible. With a larger class of variable-length encodings, you might still be able to implement a variant of BM which allows forward skips.
There are a number of algorithms which require the ability to reset the string pointer to a previously encountered point; one example is word-wrapping an input to a specific line length. Those won't be impeded by your encoding provided your API allows for saving a copy of an iterator.

Algorithm to un-concatenate words from string without spaces and punctuation

I've been given a problem in my data structures class to find the solution to this problem. It's similar to an interview question. If someone could explain the thinking process or solution to the problem. Pseudocode can be used. So far i've been thinking to use tries to hold the dictionary and look up words that way for efficiency.
This is the problem:
Oh, no! You have just completed a lengthy document when you have an unfortunate Find/Replace mishap. You have accidentally removed all spaces, punctuation, and capitalization in the document. A sentence like "I reset the computer. It still didn't boot!" would become "iresetthecomputeritstilldidntboot". You figure that you can add back in the punctation and capitalization later, once you get the individual words properly separated. Most of the words will be in a dictionary, but some strings, like proper names, will not.
Given a dictionary (a list of words), design an algorithm to find the optimal way of "unconcatenating" a sequence of words. In this case, "optimal" is defined to be the parsing which minimizes the number of unrecognized sequences of characters.
For example, the string "jesslookedjustliketimherbrother" would be optimally parsed as "JESS looked just like TIM her brother". This parsing has seven unrecognized characters, which we have capitalized for clarity.
For each index, n, into the string, compute the cost C(n) of the optimal solution (ie: the number of unrecognised characters in the optimal parsing) starting at that index.
Then, the solution to your problem is C(0).
There's a recurrence relation for C. At each n, either you match a word of i characters, or you skip over character n, incurring a cost of 1, and then parse the rest optimally. You just need to find which of those choices incurs the lowest cost.
Let N be the length of the string, and let W(n) be a set containing the lengths of all words starting at index n in your string. Then:
C(N) = 0
C(n) = min({C(n+1) + 1} union {C(n+i) for i in W(n)})
This can be implemented using dynamic programming by constructing a table of C(n) starting from the end backwards.
If the length of the longest word in your dictionary is L, then the algorithm runs in O(NL) time in the worst case and can be implemented to use O(L) memory if you're careful.
You could use rolling hashes of different lengths to speed up the search.
You can try a partial pattern matcher for example aho-corasick algorithm. Basically it's a special space optimized version of a suffix tree.

String datastructure supporting append, prepend and search operations

I need to build a text editor as my mini project, and I need to design a data structure or algorithm that supports following operation:
Append : Append a character at the end of the String.
Prepend : Prepend a character at the beginning of the string.
Search : Given a search string s, find all the occurrences of the string.
Each operation in O(log n) time or less. Search and replace operations will be appreciable but not necessary. The maximum length of string is constant. Any ideas how to achieve this?
Thanks!
A common data structure for this kind of application is a Rope, where Append and Prepend are O(1), although that depends a bit on whether the tree is balanced. However, as noted by Толя, Search would be linear.
There are certainly data structures that can make the search faster, such as a Suffix Tree, but they are probably not appropriate for a text editor application.
I would propose you adapt a Trie. On an append operation add all the suffixes of the string ending at the new character with length up to the maximum length of the string in the datastructure. On prepend add all the prefixes of the string starting at the new char with length up to the fixed length of the string. Asymptotically both operations are constant - they take O(k^2) where k is the fixed length of the string. For each node in the structure keep track of all the strings ending at that node(possibly a list).
A search operation will again be constant: iterate over the string and output all indexes stored in the ending node(if you have not "dropped out the tree").
A drawback of my approach is the memory overhead(at most times the maximum length of a word), but if the maximum string length allowed is reasonable and you only insert real words(from English dictionary for instance), this should not be a big problem.

Symbolic representation of patterns in strings, and finding "similar" sub-patterns

A string "abab" could be thought of as a pattern of indexed symbols "0101". And a string "bcbc" would also be represented by "0101". That's pretty nifty and makes for powerful comparisons, but it quickly falls apart out of perfect cases.
"babcbc" would be "010202". If I wanted to note that it contains a pattern equal to "0101" (the bcbc part), I can only think of doing some sort of normalization process at each index to "re-represent" the substring from n to length symbolically for comparison. And that gets complicated if I'm trying to see if "babcbc" and "dababd" (010202 vs 012120) have anything in common. So inefficient!
How could this be done efficiently, taking care of all possible nested cases? Note that I'm looking for similar patterns, not similar sub-strings in the actual text.
Try replacing each character with min(K, distance back to previous occurrence of that character), where K is a tunable constant so babcbc and dababd become something like KK2K22 and KKK225. You could use a suffix tree or suffix array to find repeats in the transformed text.
You're algorithm has loss of information from compressing the string's original data set so I'm not sure you can recover the full information set without doing far more work than comparing the original string. Also while your data set appears easier for human readability, it current takes up as much space as the original string and a difference map of the string (where the values are the distance between the prior character and current character) may have a more comparable information set.
However, as to how you can detect all common subsets you should look at Least Common Subsequence algorithms to find the largest matching pattern. It is a well defined algorithm and is efficient -- O(n * m), where n and m are the lengths of the strings. See LCS on SO and Wikipedia. If you also want to see patterns which wrap around a string (as a circular stirng -- where abeab and eabab should match) then you'll need a ciruclar LCS which is described in a paper by Andy Nguyen.
You'll need to change the algorithm slightly to account for number of variations so far. My advise would be to add two additional dimensions to the LCS table representing the number of unique numbers encountered in the past k characters of both original strings along with you're compressed version of each string. Then you could do an LCS solve where you are always moving in the direction which matches on your compressed strings AND matching the same number of unique characters in both strings for the past k characters. This should encode all possible unique substring matches.
The tricky part will be always choosing the direction which maximizes the k which contains the same number of unique characters. Thus at each element of the LCS table you'll have an additional string search for the best step of k value. Since a longer sequence always contains all possible smaller sequences, if you maximize you're k choice during each step you know that the best k on the next iteration is at most 1 step away, so once the 4D table is filled out it should be solvable in a similar fashion to the original LCS table. Note that because you have a 4D table the logic does get more complicated, but if you read how LCS works you'll be able to see how you can define consistent rules for moving towards the upper left corner at each step. Thus the LCS algorithm stays the same, just scaled to more dimensions.
This solution is quite complicated once it's complete, so you may want to rethink what you're trying to achieve/if this pattern encodes the information you actually want before you start writing such an algorithm.
Here goes a solution that uses Prolog's unification capabilities and attributed variables to match templates:
:-dynamic pattern_i/3.
test:-
retractall(pattern_i(_,_,_)),
add_pattern(abab),
add_pattern(bcbc),
add_pattern(babcbc),
add_pattern(dababd),
show_similarities.
show_similarities:-
call(pattern_i(Word, Pattern, Maps)),
match_pattern(Word, Pattern, Maps),
fail.
show_similarities.
match_pattern(Word, Pattern, Maps):-
all_dif(Maps), % all variables should be unique
call(pattern_i(MWord, MPattern, MMaps)),
Word\=MWord,
all_dif(MMaps),
append([_, Pattern, _], MPattern), % Matches patterns
writeln(words(Word, MWord)),
write('mapping: '),
match_pattern1(Maps, MMaps). % Prints mappings
match_pattern1([], _):-
nl,nl.
match_pattern1([Char-Char|Maps], MMaps):-
select(MChar-Char, MMaps, NMMaps),
write(Char), write('='), write(MChar), write(' '),
!,
match_pattern1(Maps, NMMaps).
add_pattern(Word):-
word_to_pattern(Word, Pattern, Maps),
assertz(pattern_i(Word, Pattern, Maps)).
word_to_pattern(Word, Pattern, Maps):-
atom_chars(Word, Chars),
chars_to_pattern(Chars, [], Pattern, Maps).
chars_to_pattern([], Maps, [], RMaps):-
reverse(Maps, RMaps).
chars_to_pattern([Char|Tail], Maps, [PChar|Pattern], NMaps):-
member(Char-PChar, Maps),
!,
chars_to_pattern(Tail, Maps, Pattern, NMaps).
chars_to_pattern([Char|Tail], Maps, [PChar|Pattern], NMaps):-
chars_to_pattern(Tail, [Char-PChar|Maps], Pattern, NMaps).
all_dif([]).
all_dif([_-Var|Maps]):-
all_dif(Var, Maps),
all_dif(Maps).
all_dif(_, []).
all_dif(Var, [_-MVar|Maps]):-
dif(Var, MVar),
all_dif(Var, Maps).
The idea of the algorithm is:
For each word generate a list of unbound variables, where we use the same variable for the same char in the word. e.g: for the word abcbc the list would look something like [X,Y,Z,Y,Z]. This defines the template for this word
Once we have the list of templates we take each one and try to unify the template with a subtemplate of every other word. So for example if we have the words abcbc and zxzx, the templates would be [X,Y,Z,Y,Z] and [H,G,H,G]. Then there is a subtemplate on the first template which unifies with the template of the second word (H=Y, G=Z)
For each template match we show the substitutions needed (variable renamings) to yield that match. So in our example the substitutions would be z=b, x=c
Output for test (words abab, bcbc, babcbc, dababd):
?- test.
words(abab,bcbc)
mapping: a=b b=c
words(abab,babcbc)
mapping: a=b b=c
words(abab,dababd)
mapping: a=a b=b
words(bcbc,abab)
mapping: b=a c=b
words(bcbc,babcbc)
mapping: b=b c=c
words(bcbc,dababd)
mapping: b=a c=b

Removing repeated characters in string without using recursion

You are given a string. Develop a function to remove duplicate characters from that string. String could be of any length. Your algorithm must be in space. If you wish you can use constant size extra space which is not dependent any how on string size. Your algorithm must be of complexity of O(n).
My idea was to define an integer array of size of 26 where 0th index would correspond to the letter a and the 25th index for the letter z and initialize all the elements to 0.
Thus we will travel the entire string once and and would increment the value at the desired index as and when we encounter a letter.
and then we will travel the string once again and if the value at the desired index is 1 we print out the letter otherwise we do not.
In this way the time complexity is O(n) and the space used is constant irrespective of the length of the string!!
if anyone can come up with ideas of better efficiency,it will be very helpful!!
Your solution definitely fits the criteria of O(n) time. Instead of an array, which would be very, very large if the allowed alphabet is large (Unicode has over a million characters), you could use a plain hash. Here is your algorithm in (unoptimized!) Ruby:
def undup(s)
seen = Hash.new(0)
s.each_char {|c| seen[c] += 1}
result = ""
s.each_char {|c| result << c if seen[c] == 1}
result
end
puts(undup "")
puts(undup "abc")
puts(undup "Olé")
puts(undup "asdasjhdfasjhdfasbfdasdfaghsfdahgsdfahgsdfhgt")
It makes two passes through the string, and since hash lookup is less than linear, you're good.
You can say the Hashtable (like your array) uses constant space, albeit large, because it is bounded above by the size of the alphabet. Even if the size of the alphabet is larger than that of the string, it still counts as constant space.
There are many variations to this problem, many of which are fun. To do it truly in place, you can sort first; this gives O(n log n). There are variations on merge sort where you ignore dups during the merge. In fact, this "no external hashtable" restriction appears in Algorithm: efficient way to remove duplicate integers from an array (also tagged interview question).
Another common interview question starts with a simple string, then they say, okay now a million character string, okay now a string with 100 billion characters, and so on. Things get very interesting when you start considering Big Data.
Anyway, your idea is pretty good. It can generally be tweaked as follows: Use a set, not a dictionary. Go trough the string. For each character, if it is not in the set, add it. If it is, delete it. Sets take up less space, don't need counters, and can be implemented as bitsets if the alphabet is small, and this algorithm does not need two passes.
Python implementation: http://code.activestate.com/recipes/52560-remove-duplicates-from-a-sequence/
You can also use a bitset instead of the additional array to keep track of found chars. Depending on which characters (a-z or more) are allowed you size the bitset accordingly. This requires less space than an integer array.

Resources