Palindrome Validity Proof of Correctness - string

Leetcode Description
Given a string of length n, you have to decide whether the string is a palindrome or not, but you may delete at most one character.
For example: "aba" is a palindrome.
"abca" is also valid, since we may remove either "b" or "c" to get a palindrome.
I have seen many solutions that take the following approach.
With two pointers left and right initialized to the start and the end characters of the string, respectively, keep incrementing left and decrementing right synchronously as long as the two characters pointed by left and right are equal.
The first time we run into a mismatch between the characters pointed by left and right, and say these are specifically indices i and j, we simply check whether string[i..j-1] or string[i+1..j] is a palindrome.
I clearly see why this works, but one thing that's bothering me is the direction of the approach that we take when we first see the mismatch.
Assuming we do not care about time efficiency and only focus on correctness, I cannot see what prevents us from trying to delete a character in string[0..i-1] or string[j+1..n-1] and try to look whether the entire resulting string can become a palindrome or not?
More specifically, if we take the first approach and see that both string[i..j-1] and string[i+1..j] are not palindromes, what prevents us from backtracking to the second approach I described and see if deleting a character from string[0..i-1] or string[j+1..n-1] will yield a palindrome instead?
Can we mathematically prove why this approach is useless or simply incorrect?

Related

Does this strict weak ordering have a name (spoilers for a specific coding puzzle)

There is a coding puzzle I have encountered on one of those sites (I don't recall if it was leetcode or something else) which goes as follows: Given a list of strings, return the lexicographically smallest concatenation that uses each of the strings once. The solution is, on a technical level, fairly simple. You compare 2 strings a and b by checking whether ab<ba (lexicographically), sort the list, concatenate everything.
Now for the actual question: Does this ordering have a name? I tried googling around but never found anything.
There is also a secondary aspect to this, which is: Is it somehow immediately obvious that this is even a strict weak ordering? I certainly didn't think it was. Here is the proof that I came up with to convince myself that it is one:
For any given string s let |s| be its length and let s^n be s repeated n times.
If ab<ba then a^|b|b^|a|<b^|a|a^|b| (to see this just successively swap neighboring ab pairs to get a lexicographically increasing sequence that ends in b^|a|a^|b|). It follows that a^|b|<b^|a| because they have the same length. The same argument works for > and = so we have proven that ab<ba is actually equivalent to a^|b|<b^|a|, with the latter clearly defining a strict weak ordering.

Algorithm (or pointer to literature) sought for string processing challenge

A group of amusing students write essays exclusively by plagiarising portions of the complete works of WIlliam Shakespere. At one end of the scale, an essay might exclusively consist a verbatim copy of a soliloquy... at the other, one might see work so novel that - while using a common alphabet - no two adjacent characters in the essay were used adjacently by Will.
Essays need to be graded. A score of 1 is assigned to any essay which can be found (character-by-character identical) in the plain-text of the complete works. A score of 2 is assigned to any work that can be successfully constructed from no fewer than two distinct (character-by-character identical) passages in the complete works, and so on... up to the limit - for an essay with N characters - which scores N if, and only if, no two adjacent characters in the essay were also placed adjacently in the complete works.
The challenge is to implement a program which can efficiently (and accurately) score essays. While any (practicable) data-structure to represent the complete works is acceptable - the essays are presented as ASCII strings.
Having considered this teasing question for a while, I came to the conclusion that it is much harder than it sounds. The naive solution, for an essay of length N, involves 2**(N-1) traversals of the complete works - which is far too inefficient to be practical.
While, obviously, I'm interested in suggested solutions - I'd also appreciate pointers to any literature that deals with this, or any similar, problem.
CLARIFICATIONS
Perhaps some examples (ranging over much shorter strings) will help clarify the 'score' for 'essays'?
Assume Shakespere's complete works are abridged to:
"The quick brown fox jumps over the lazy dog."
Essays scoring 1 include "own fox jump" and "The quick brow". The essay "jogging" scores 6 (despite being short) because it can't be represented in fewer than 6 segments of the complete works... It can be segmented into six strings that are all substrings of the complete works as follows: "[j][og][g][i][n][g]". N.B. Establishing scores for this short example is trivial compared to the original problem - because, in this example "complete works" - there is very little repetition.
Hopefully, this example segmentation helps clarify the 2*(N-1) substring searches in the complete works. If we consider the segmentation, the (N-1) gaps between the N characters in the essay may either be a gap between segments, or not... resulting in ~ 2*(N-1) substring searches of the complete works to test each segmentation hypothesis.
An (N)DFA would be a wonderful solution - if it were practical. I can see how to construct something that solved 'substring matching' in this way - but not scoring. The state space for scoring, on the surface, at least, seems wildly too large (for any substantial complete works of Shakespere.) I'd welcome any explanation that undermines my assumptions that the (N)DFA would be too large to be practical to compute/store.
A general approach for plagiarism detection is to append the student's text to the source text separated by a character not occurring in either and then to build either a suffix tree or suffix array. This will allow you to find in linear time large substrings of the student's text which also appear in the source text.
I find it difficult to be more specific because I do not understand your explanation of the score - the method above would be good for finding the longest stretch in the students work which is an exact quote, but I don't understand your N - is it the number of distinct sections of source text needed to construct the student's text?
If so, there may be a dynamic programming approach. At step k, we work out the least number of distinct sections of source text needed to construct first k characters of the student's text. Using a suffix array built just from the source text or otherwise, we find the longest match between the source text and characters x..k of the student's text, where x is of course as small as possible. Then the least number of sections of source text needed to construct the first k characters of student text is the least needed to construct 1..x-1 (which we have already worked out) plus 1. By running this process for k=1..the length of the student text we find the least number of sections of source text needed to reconstruct the whole of it.
(Or you could just search StackOverflow for the student's text, on the grounds that students never do anything these days except post their question on StackOverflow :-)).
I claim that repeatedly moving along the target string from left to right, using a suffix array or tree to find the longest match at any time, will find the smallest number of different strings from the source text that produces the target string. I originally found this by looking for a dynamic programming recursion but, as pointed out by Evgeny Kluev, this is actually a greedy algorithm, so let's try and prove this with a typical greedy algorithm proof.
Suppose not. Then there is a solution better than the one you get by going for the longest match every time you run off the end of the current match. Compare the two proposed solutions from left to right and look for the first time when the non-greedy solution differs from the greedy solution. If there are multiple non-greedy solutions that do better than the greedy solution I am going to demand that we consider the one that differs from the greedy solution at the last possible instant.
If the non-greedy solution is going to do better than the greedy solution, and there isn't a non-greedy solution that does better and differs later, then the non-greedy solution must find that, in return for breaking off its first match earlier than the greedy solution, it can carry on its next match for longer than the greedy solution. If it can't, it might somehow do better than the greedy solution, but not in this section, which means there is a better non-greedy solution which sticks with the greedy solution until the end of our non-greedy solution's second matching section, which is against our requirement that we want the non-greedy better solution that sticks with the greedy one as long as possible. So we have to assume that, in return for breaking off the first match early, the non-greedy solution gets to carry on its second match longer. But this doesn't work, because, when the greedy solution finally has to finish using its first match, it can jump on to the same section of matching text that the non-greedy solution is using, just entering that section later than the non-greedy solution did, but carrying on for at least as long as the non-greedy solution. So there is no non-greedy solution that does better than the greedy solution and the greedy solution is optimal.
Have you considered using N-Grams to solve this problem?
http://en.wikipedia.org/wiki/N-gram
First read the complete works of Shakespeare and build a trie. Then process the string left to right. We can greedily take the longest substring that matches one in the data because we want the minimum number of strings, so there is no factor of 2^N. The second part is dirt cheap O(N).
The depth of the trie is limited by the available space. With a gigabyte of ram you could reasonably expect to exhaustively cover Shakespearean English string of length at least 5 or 6. I would require that the leaf nodes are unique (which also gives a rule for constructing the trie) and keep a pointer to their place in the actual works, so you have access to the continuation.
This feels like a problem of partial matching a very large regular expression.
If so it can be solved by a very large non deterministic finite state automata or maybe more broadly put as a graph representing for every character in the works of Shakespeare, all the possible next characters.
If necessary for efficiency reasons the NDFA is guaranteed to be convertible to a DFA. But then this construction can give rise to 2^n states, maybe this is what you were alluding to?
This aspect of the complexity does not really worry me. The NDFA will have M + C states; one state for each character and C states where C = 26*2 + #punctuation to connect to each of the M states to allow the algorithm to (re)start when there are 0 matched characters. The question is would the corresponding DFA have O(2^M) states and if so is it necessary to make that DFA, theoretically it's not necessary. However, consider that in the construction, each state will have one and only one transition to exactly one other state (the next state corresponding to the next character in that work). We would expect that each one of the start states will be connected to on average M/C states, but in the worst case M meaning the NDFA will have to track at most M simultaneous states. That's a large number but not an impossibly large number for computers these days.
The score would be derived by initializing to 1 and then it would incremented every time a non-accepting state is reached.
It's true that one of the approaches to string searching is building a DFA. In fact, for the majority of the string search algorithms, it looks like a small modification on failure to match (increment counter) and success (keep going) can serve as a general strategy.

How to find the period of a string

I take a input from the user and its a string with a certain substring which repeats itself all through the string. I need to output the substring or its length AKA period.
Say
S1 = AAAA // substring is A
S2 = ABAB // Substring is AB
S3 = ABCAB // Substring is ABC
S4 = EFIEFI // Substring is EFI
I could start with a Single char and check if it is same as its next character if it is not, I could do it with two characters then with three and so on. This would be a O(N^2) algo. I was wondering if there is a more elegant solution to this.
You can do this in linear time and constant additional space by inductively computing the period of each prefix of the string. I can't recall the details (there are several things to get right), but you can find them in Section 13.6 of "Text algorithms" by Crochemore and Rytter under function Per(x).
Let me assume that the length of the string n is at least twice greater than the period p.
Algorithm
Let m = 1, and S the whole string
Take m = m*2
Find the next occurrence of the substring S[:m]
Let k be the start of the next occurrence
Check if S[:k] is the period
if not go to 2.
Example
Suppose we have a string
CDCDFBFCDCDFDFCDCDFBFCDCDFDFCDC
For each power m of 2 we find repetitions of first 2^m characters. Then we extend this sequence to it's second occurrence. Let's start with 2^1 so CD.
CDCDFBFCDCDFDFCDCDFBFCDCDFDFCDC
CDCD CDCD CDCD CDCD CD
We don't extend CD since the next occurrence is just after that. However CD is not the substring we are looking for so let's take the next power: 2^2 = 4 and substring CDCD.
CDCDFBFCDCDFDFCDCDFBFCDCDFDFCDC
CDCD CDCD
Now let's extend our string to the first repetition. We get
CDCDFBF
we check if this is periodic. It is not so we go further. We try 2^3 = 8, so CDCDFBFC
CDCDFBFCDCDFDFCDCDFBFCDCDFDFCDC
CDCDFBFC CDCDFBFC
we try to extend and we get
CDCDFBFCDCDFDF
and this indeed is our period.
I expect this to work in O(n log n) with some KMP-like algorithm for checking where a given string appears. Note that some edge cases still should be worked out here.
Intuitively this should work, but my intuition failed once on this problem already so please correct me if I'm wrong. I will try to figure out a proof.
A very nice problem though.
You can build a suffix tree for the entire string in linear time (suffix tree is easy to look up online), and then recursively compute and store the number of suffix tree leaves (occurences of the suffix prefix) N(v) below each internal node v of the suffix tree. Also recursively compute and store the length of each suffix prefix L(v) at each node of the tree. Then, at an internal node v in the tree, the suffix prefix encoded at v is a repeating subsequence that generates your string if N(v) equals the total length of the string divided by L(v).
We can actually optimise the time complexity by creating a Z Array. We can create Z array in O(n) time and O(n) space. Now, lets say if there is string
S1 = abababab
For this the z array would like
z[]={8,0,6,0,4,0,2,0};
In order to calcutate the period we can iterate over the z array and
use the condition, where i+z[i]=S1.length. Then, that i would be the period.
Well if every character in the input string is part of the repeating substring, then all you have to do is store first character and compare it with rest of the string's characters one by one. If you find a match, string until to matched one is your repeating string.
I too have been looking for the time-space-optimal solution to this problem. The accepted answer by tmyklebu essentially seems to be it, but I would like to offer some explanation of what it's actually about and some further findings.
First, this question by me proposes a seemingly promising but incorrect solution, with notes on why it's incorrect: Is this algorithm correct for finding period of a string?
In general, the problem "find the period" is equivalent to "find the pattern within itself" (in some sense, "strstr(x+1,x)"), but with no constraints matching past its end. This means that you can find the period by taking any left-to-right string matching algorith, and applying it to itself, considering a partial match that hits the end of the haystack/text as a match, and the time and space requirements are the same as those of whatever string matching algorithm you use.
The approach cited in tmyklebu's answer is essentially applying this principle to String Matching on Ordered Alphabets, also explained here. Another time-space-optimal solution should be possible using the GS algorithm.
The fairly well-known and simple Two Way algorithm (also explained here) unfortunately is not a solution because it's not left-to-right. In particular, the advancement after a mismatch in the left factor depends on the right factor having been a match, and the impossibility of another match misaligned with the right factor modulo the right factor's period. When searching for the pattern within itself and disregarding anything past the end, we can't conclude anything about how soon the next right-factor match could occur (part or all of the right factor may have shifted past the end of the pattern), and therefore a shift that preserves linear time cannot be made.
Of course, if working space is available, a number of other algorithms may be used. KMP is linear-time with O(n) space, and it may be possible to adapt it to something still reasonably efficient with only logarithmic space.

NSString compare efficiency

String compares can be costly. There's some statistic floating around that says a very high percent of string compares can be eliminated by first comparing string sizes. So I'm curious to know whether the NSString compare: method takes this into consideration. Anyone know?
According to the sources here (which is just one implementation, others may act differently), compare doesn't check the length first, which actually makes sense since it's not an equality check. As it returns a less-than/equal-to/greater-than return code, it has to check the characters, even if the lengths are the same.
A pure isEqual-type method may be able to shortcut character checks if the lengths are different, but compare does not have that luxury.
It does do certain checks of the length against zero, but not comparisons of the two lengths against each other.
Yes it does. It also checks for pointer equality before that (which covers the constant string case and some others due to string uniquing and the string ROM).
(edit) This answer applies to -isEqualToString:, not -compare:. I misread

Longest palindromic substring and suffix trie

I was Googling about a rather well-known problem, namely: the longest palindromic substring
I have found links that recommend suffix tries as a good solution to the problem.
Example SO and Algos
The approach is (as I understand it) e.g. for a string S create Sr (which is S reversed) and then create a generalized suffix trie.
Then find the longest common sustring of S and Sr which is the path from the root to the deepest node that belongs both to S and Sr.
So the solution using the suffix tries approach essentially reduces to Find the longest common substring problem.
My question is the following:
If the input string is: S = “abacdfgdcaba” so , Sr = “abacdgfdcaba” the longest common substring is abacd which is NOT a palindrome.
So my question is: Is the approach of using suffix tries erroneous? Am i missunderstanding/misreading here?
Yes, finding longest palindrome by using LCS like algorithms is not a good way, I didn't read referenced answer carefully but this line in the answer is completely wrong:
So the longest contained palindrome within a string is exactly the longest common substring of this string and its reverse
but if you read it and you have a counter example don't worry about it (you are right in 99%), this is common mistake, But simple way is as follow:
Write down the string (barbapapa) as follow: #b#a#r#b#a#p#a#p#a#, now traverse each character of this new string from left to right, check its left and right to check whether it's a palindrome center or not. This algorithm is O(n^2) in worst case and works perfectly correct. but normally will finds palindrome in O(n) (sure proving this in average case is hard). Worst case is in strings with too many long palindromes like aaaaaa...aaaa.
But there is better approach which takes O(n) time, base of this algorithm is by Manacher. Related algorithm is more complicated than what I saw in your referenced answer. But what I offered is base idea of Manacher algorithm, with clever changes in algorithm you can skip checking all left and rights (also there are algorithms by using suffix trees).
P.S: I couldn't see your Algo link because of my internet limitations, I don't know it's correct or not.
I added my discussion with OP to clarify the algorithm:
let test it with barbapapa-> #b#a#r#b#a#p#a#p#a#, start from first #
there is no left so it's center of palindrome with length 1.
Now "b",has # in left and # in right, but there isn't another left to match with right
so it's a center of palindrome with length 3.
let skip other parts to arrive to first "p":
first left and right is # second left and right is "a", third left and
right is # but forth left and right are not equal so it's center of palindrome
of length 7 #a#p#a# is palindrome but b#a#p#a#p is not
Now let see first "a" after first "p" you have, #a#p#a#p#a# as palindrome and this "a"
is center of this palindrome with length 11 if you calculate all other palindromes
length of all of them are smaller than 11
Also using # is because considering palindromes of even length.
After finding center of palindrome in newly created string, find related palindrom (by knowing the center and its length), then remove # to find out biggest palindrome.

Resources