Maximum repeating substring of size n

Maximum repeating substring of size n - string

Find the substring of length n that repeats a maximum number of times in a given string.
Input: abbbabbbb# 2
Output: bb
My solution:
public static String mrs(String s, int m) {
int n = s.length();
String[] suffixes = new String[n-m+1];
for (int i = 0; i < n-m+1; i++) {
suffixes[i] = s.substring(i, i+m);
}
Arrays.sort(suffixes);
String ans = "", tmp=suffixes[0].substring(0,m);
int cnt = 1, max=0;
for (int i = 0; i < n-m; i++) {
if (suffixes[i].equals(suffixes[i+1])){
cnt++;
}else{
if(cnt>max){
max = cnt;
ans =tmp;
}
cnt=0;
tmp = suffixes[i];
}
}
return ans;
}
Can it be done better than the above O(nm) time and O(n) space solution?

For a string of length L and a given length k (not to mess up with n and m which the question interchanges at times), we can compute polynomial hashes of all substrings of length k in O(L) (see Wikipedia for some elaboration on this subproblem).
Now, if we map the hash values to the number of times they occur, we get the value which occurs most frequently in O(L) (with a HashMap with high probability, or in O(L log L) with a TreeMap).
After that, just take the substring which got the most frequent hash as the answer.
This solution does not take hash collisions into account.
The idea is to just reduce the probability of collisions enough for the application (if it's too high, use multiple hashes, for example).
If the application demands that we absolutely never give a wrong answer, we can check the answer in O(L) with another algorithm (KMP, for example), and re-run the whole solution with a different hash function as long as the answer turns out to be wrong.

Related

Longest common prefix - comparing time complexity of two algorithms

If you comparing these two solutions the time complexity of the first solution is O(array-len*sortest-string-len) that you may shorten it to O(n*m) or even O(n^2). And the second one seems O(n * log n) as it has a sort method and then comparing the first and the last item so it would be O(n) and don't have any effect on the O.
But, what happens to the comparing the strings item in the list. Sorting a list of integer values is O(n * log n) but don't we need to compare the characters in the strings to be able to sort them? So, am I wrong if I say the time complexity of the second solution is O(n * log n * longest-string-len)?
Also, as it does not check the prefixes while it is sorting it would do the sorting (the majority of the times) anyway so its best case is far worse than the other option? Also, for the worst-case scenario if you consider the point I mentioned it would still be worse than the first solution?
public string longestCommonPrefix(List<string> input) {
if(input.Count == 0) return "";
if(input.Count == 1) return input[0];
var sb = new System.Text.StringBuilder();
for(var charIndex = 0; charIndex < input[0].Length; charIndex++)
{
for(var itemIndex = 1; itemIndex < input.Count; itemIndex++)
{
if(input[itemIndex].Length > charIndex)
return sb.ToString();
if(input[0][charIndex] != input[itemIndex][charIndex])
return sb.ToString();
}
sb.Append(input[0][charIndex]);
}
return sb.ToString();
}
static string longestCommonPrefix(String[] a)
{
int size = a.Length;
/* if size is 0, return empty string */
if (size == 0)
return "";
if (size == 1)
return a[0];
/* sort the array of strings */
Array.Sort(a);
/* find the minimum length from first
and last string */
int end = Math.Min(a[0].Length,
a[size-1].Length);
/* find the common prefix between the
first and last string */
int i = 0;
while (i < end && a[0][i] == a[size-1][i] )
i++;
string pre = a[0].Substring(0, i);
return pre;
}

First of all, unless I am missing something obvious, the first method runs in O(N * shortest-string-length); shortest, not longest.
Second, you may not reduce O(n*m) to O(n^2): the number of strings and their length are unrelated.
Finally, you are absolutely right. Sorting indeed takes O(n*log(n)*m), so in no case it would improve the performance.
As a side note, it may be beneficial to find the shortest string beforehand. This would make a input[itemIndex].Length > charIndex unnecessary.

Leetcode--3 find the longest substring without repeating character

The target is simple--find the longest substring without repeating characters,here is the code:
class Solution {
public:
int lengthOfLongestSubstring(string s) {
int ans = 0;
int dic[256];
memset(dic, -1, sizeof(dic));
int len = s.size();
int idx = -1;
for (int i = 0;i < len;i++) {
char c = s[i];
if (dic[c] > idx)
idx = dic[c];
ans = max(ans, i - idx);
dic[c] = i;
}
return ans;
}
};
From its concise expression,I think this is a high-performance method,and we can get that its Time Complexity is just O(n).But I'm confused about this method,though I came up with some examples to understand,can anyone give some tips or idea to me?

What it is doing is recording the position where each character was last seen.
As you step through, it takes each new encountered character and the length of non-repeat goes back at least as far as that last-seen, but for future indices can't go back further, as we have now seen a duplicate.
So we are maintaining in idx, the start index of the latest-seen highest-starting duplicate, which is the candidate for the start of the longest non-duplicating sequence.
I'm certain that the ans = max() code code be optimised slightly, as after encountering a new duplicate, you have to go forward at least ans chars from the start of that duplicate before ans can be improved again. You still need to do the rest of the work maintaining dic and idx, but you could avoid that particular test for ans for a few iterations. You would have to do a lot of unrolling to benefit, though.

O(n) time complexity and O(1) space complexity way to see if two strings are permutations of each other

Is there an algorithm that can see if two strings are permutations of each other with O(n) time complexity and O(1) space complexity?

Yes sure there is a very nice way. You have to use count sort for this. There is no reason to generate prime numbers at all. Here is a C code snippet that describes the algorithm:
bool is_permutation(string s1, string s2) {
if(s1.length() != s2.length()) return false;
int count[256]; //assuming each character fits in one byte, also the authors sample solution seems to have this boundary
for(int i=0;i<256;i++) count[i]=0;
for(int i=0;i<s1.length();i++) { //count the digits to see if each digits occur same number of times in both strings
count[ s1[i] ]++;
count[ s2[i] ]--;
}
for(int i=0;i<256;i++) { //see if there is any digit that appeared in different frequency
if(count[i]!=0) return false;
}
return true;
}
EDIT: (I decided to add this after some comments related to order of my program)
The Lets try to calculate the time complexity of the algorithm I have used in my program:
n = max len of strings
m = max allowed different characters, assuming will having all consecutive ascii value in range [0,m-1]
Time complexity: O(max(n,m))
Memory Complexity O(m)
Now assuming m is a constant here the order becomes
Time complexity: O(n)
Memory Complexity O(1)

Here is a simple program I wrote in java that gives the answer in O(n) for time complexity and O(1) for space complexity. It works by mapping every character to a prime number and then multiplying together all of the characters in the string's prime mappings. If the two strings are permutations then they should have the same unique characters each with the same number of occurrences.
Here is some sample code that accomplishes this:
// maps keys to a corresponding unique prime
static Map<Integer, Integer> primes = generatePrimes(255); // use 255 for
// ASCII or the
// number of
// possible
// characters
public static boolean permutations(String s1, String s2) {
// both strings must be same length
if (s1.length() != s2.length())
return false;
// the corresponding primes for every char in both strings are multiplied together
int s1Product = 1;
int s2Product = 1;
for (char c : s1.toCharArray())
s1Product *= primes.get((int) c);
for (char c : s2.toCharArray())
s2Product *= primes.get((int) c);
return s1Product == s2Product;
}
private static Map<Integer, Integer> generatePrimes(int n) {
Map<Integer, Integer> primes = new HashMap<Integer, Integer>();
primes.put(0, 2);
for (int i = 2; primes.size() < n; i++) {
boolean divisible = false;
for (int v : primes.values()) {
if (i % v == 0) {
divisible = true;
break;
}
}
if (!divisible) {
primes.put(primes.size(), i);
System.out.println(i + " ");
}
}
return primes;
}

Google Interview : Find Crazy Distance Between Strings

This Question was asked to me at the Google interview. I could do it O(n*n) ... Can I do it in better time.
A string can be formed only by 1 and 0.
Definition:
X & Y are strings formed by 0 or 1
D(X,Y) = Remove the things common at the start from both X & Y. Then add the remaining lengths from both the strings.
For e.g.
D(1111, 1000) = Only First alphabet is common. So the remaining string is 111 & 000. Therefore the result length("111") & length("000") = 3 + 3 = 6
D(101, 1100) = Only First two alphabets are common. So the remaining string is 01 & 100. Therefore the result length("01") & length("100") = 2 + 3 = 5
It is pretty that obvious that do find out such a crazy distance is going to be linear. O(m).
Now the question is
given n input, say like
1111
1000
101
1100
Find out the maximum crazy distance possible.
n is the number of input strings.
m is the max length of any input string.
The solution of O(n2 * m) is pretty simple. Can it be done in a better way?
Let's assume that m is fixed. Can we do this in better than O(n^2) ?

Put the strings into a tree, where 0 means go left and 1 means go right. So for example
1111
1000
101
1100
would result in a tree like
Root
1
0 1
0 1* 0 1
0* 0* 1*
where the * means that an element ends there. Constructing this tree clearly takes O(n m).
Now we have to find the diameter of the tree (the longest path between two nodes, which is the same thing as the "crazy distance"). The optimized algorithm presented there hits each node in the tree once. There are at most min(n m, 2^m) such nodes.
So if n m < 2^m, then the the algorithm is O(n m).
If n m > 2^m (and we necessarily have repeated inputs), then the algorithm is still O(n m) from the first step.
This also works for strings with a general alphabet; for an alphabet with k letters build a k-ary tree, in which case the runtime is still O(n m) by the same reasoning, though it takes k times as much memory.

I think this is possible in O(nm) time by creating a binary tree where each bit in a string encodes the path (0 left, 1 right). Then finding the maximum distance between nodes of the tree which can be done in O(n) time.

This is my solution, I think it works:
Create a binary tree from all strings. The tree will be constructed in this way:
at every round, select a string and add it to the tree. so for your example, the tree will be:
<root>
<1> <empty>
<1> <0>
<1> <0> <1> <0>
<1> <0> <0>
So each path from root to a leaf will represent a string.
Now the distance between each two leaves is the distance between two strings. To find the crazy distance, you must find the diameter of this graph, that you can do it easily by dfs or bfs.
The total complexity of this algorithm is:
O(n*m) + O(n*m) = O(n*m).

I think this problem is something like "find prefix for two strings", you can use trie(http://en.wikipedia.org/wiki/Trie) to accerlate searching
I have a google phone interview 3 days before, but maybe I failed...
Best luck to you

To get an answer in O(nm) just iterate across the characters of all string (this is an O(n) operation). We will compare at most m characters, so this will be done O(m). This gives a total of O(nm). Here's a C++ solution:
int max_distance(char** strings, int numstrings, int &distance) {
distance = 0;
// loop O(n) for initialization
for (int i=0; i<numstrings; i++)
distance += strlen(strings[i]);
int max_prefix = 0;
bool done = false;
// loop max O(m)
while (!done) {
int c = -1;
// loop O(n)
for (int i=0; i<numstrings; i++) {
if (strings[i][max_prefix] == 0) {
done = true; // it is enough to reach the end of one string to be done
break;
}
int new_element = strings[i][max_prefix] - '0';
if (-1 == c)
c = new_element;
else {
if (c != new_element) {
done = true; // mismatch
break;
}
}
}
if (!done) {
max_prefix++;
distance -= numstrings;
}
}
return max_prefix;
}
void test_misc() {
char* strings[] = {
"10100",
"10101110",
"101011",
"101"
};
std::cout << std::endl;
int distance = 0;
std::cout << "max_prefix = " << max_distance(strings, sizeof(strings)/sizeof(strings[0]), distance) << std::endl;
}

Not sure why use trees when iteration gives you the same big O computational complexity without the code complexity. anyway here is my version of it in javascript O(mn)
var len = process.argv.length -2; // in node first 2 arguments are node and program file
var input = process.argv.splice(2);
var current;
var currentCount = 0;
var currentCharLoc = 0;
var totalCount = 0;
var totalComplete = 0;
var same = true;
while ( totalComplete < len ) {
current = null;
currentCount = 0;
for ( var loc = 0 ; loc < len ; loc++) {
if ( input[loc].length === currentCharLoc) {
totalComplete++;
same = false;
} else if (input[loc].length > currentCharLoc) {
currentCount++;
if (same) {
if ( current === null ) {
current = input[loc][currentCharLoc];
} else {
if (current !== input[loc][currentCharLoc]) {
same = false;
}
}
}
}
}
if (!same) {
totalCount += currentCount;
}
currentCharLoc++;
}
console.log(totalCount);

Is there a circular hash function?

Thinking about this question on testing string rotation, I wondered: Is there was such thing as a circular/cyclic hash function? E.g.
h(abcdef) = h(bcdefa) = h(cdefab) etc
Uses for this include scalable algorithms which can check n strings against each other to see where some are rotations of others.
I suppose the essence of the hash is to extract information which is order-specific but not position-specific. Maybe something that finds a deterministic 'first position', rotates to it and hashes the result?
It all seems plausible, but slightly beyond my grasp at the moment; it must be out there already...

I'd go along with your deterministic "first position" - find the "least" character; if it appears twice, use the next character as the tie breaker (etc). You can then rotate to a "canonical" position, and hash that in a normal way. If the tie breakers run for the entire course of the string, then you've got a string which is a rotation of itself (if you see what I mean) and it doesn't matter which you pick to be "first".
So:
"abcdef" => hash("abcdef")
"defabc" => hash("abcdef")
"abaac" => hash("aacab") (tie-break between aa, ac and ab)
"cabcab" => hash("abcabc") (it doesn't matter which "a" comes first!)

Update: As Jon pointed out, the first approach doesn't handle strings with repetition very well. Problems arise as duplicate pairs of letters are encountered and the resulting XOR is 0. Here is a modification that I believe fixes the the original algorithm. It uses Euclid-Fermat sequences to generate pairwise coprime integers for each additional occurrence of a character in the string. The result is that the XOR for duplicate pairs is non-zero.
I've also cleaned up the algorithm slightly. Note that the array containing the EF sequences only supports characters in the range 0x00 to 0xFF. This was just a cheap way to demonstrate the algorithm. Also, the algorithm still has runtime O(n) where n is the length of the string.
static int Hash(string s)
{
int H = 0;
if (s.Length > 0)
{
//any arbitrary coprime numbers
int a = s.Length, b = s.Length + 1;
//an array of Euclid-Fermat sequences to generate additional coprimes for each duplicate character occurrence
int[] c = new int[0xFF];
for (int i = 1; i < c.Length; i++)
{
c[i] = i + 1;
}
Func<char, int> NextCoprime = (x) => c[x] = (c[x] - x) * c[x] + x;
Func<char, char, int> NextPair = (x, y) => a * NextCoprime(x) * x.GetHashCode() + b * y.GetHashCode();
//for i=0 we need to wrap around to the last character
H = NextPair(s[s.Length - 1], s[0]);
//for i=1...n we use the previous character
for (int i = 1; i < s.Length; i++)
{
H ^= NextPair(s[i - 1], s[i]);
}
}
return H;
}
static void Main(string[] args)
{
Console.WriteLine("{0:X8}", Hash("abcdef"));
Console.WriteLine("{0:X8}", Hash("bcdefa"));
Console.WriteLine("{0:X8}", Hash("cdefab"));
Console.WriteLine("{0:X8}", Hash("cdfeab"));
Console.WriteLine("{0:X8}", Hash("a0a0"));
Console.WriteLine("{0:X8}", Hash("1010"));
Console.WriteLine("{0:X8}", Hash("0abc0def0ghi"));
Console.WriteLine("{0:X8}", Hash("0def0abc0ghi"));
}
The output is now:
7F7D7F7F
7F7D7F7F
7F7D7F7F
7F417F4F
C796C7F0
E090E0F0
A909BB71
A959BB71
First Version (which isn't complete): Use XOR which is commutative (order doesn't matter) and another little trick involving coprimes to combine ordered hashes of pairs of letters in the string. Here is an example in C#:
static int Hash(char[] s)
{
//any arbitrary coprime numbers
const int a = 7, b = 13;
int H = 0;
if (s.Length > 0)
{
//for i=0 we need to wrap around to the last character
H ^= (a * s[s.Length - 1].GetHashCode()) + (b * s[0].GetHashCode());
//for i=1...n we use the previous character
for (int i = 1; i < s.Length; i++)
{
H ^= (a * s[i - 1].GetHashCode()) + (b * s[i].GetHashCode());
}
}
return H;
}
static void Main(string[] args)
{
Console.WriteLine(Hash("abcdef".ToCharArray()));
Console.WriteLine(Hash("bcdefa".ToCharArray()));
Console.WriteLine(Hash("cdefab".ToCharArray()));
Console.WriteLine(Hash("cdfeab".ToCharArray()));
}
The output is:
4587590
4587590
4587590
7077996

You could find a deterministic first position by always starting at the position with the "lowest" (in terms of alphabetical ordering) substring. So in your case, you'd always start at "a". If there were multiple "a"s, you'd have to take two characters into account etc.

I am sure that you could find a function that can generate the same hash regardless of character position in the input, however, how will you ensure that h(abc) != h(efg) for every conceivable input? (Collisions will occur for all hash algorithms, so I mean, how do you minimize this risk.)
You'd need some additional checks even after generating the hash to ensure that the strings contain the same characters.

Here's an implementation using Linq
public string ToCanonicalOrder(string input)
{
char first = input.OrderBy(x => x).First();
string doubledForRotation = input + input;
string canonicalOrder
= (-1)
.GenerateFrom(x => doubledForRotation.IndexOf(first, x + 1))
.Skip(1) // the -1
.TakeWhile(x => x < input.Length)
.Select(x => doubledForRotation.Substring(x, input.Length))
.OrderBy(x => x)
.First();
return canonicalOrder;
}
assuming generic generator extension method:
public static class TExtensions
{
public static IEnumerable<T> GenerateFrom<T>(this T initial, Func<T, T> next)
{
var current = initial;
while (true)
{
yield return current;
current = next(current);
}
}
}
sample usage:
var sequences = new[]
{
"abcdef", "bcdefa", "cdefab",
"defabc", "efabcd", "fabcde",
"abaac", "cabcab"
};
foreach (string sequence in sequences)
{
Console.WriteLine(ToCanonicalOrder(sequence));
}
output:
abcdef
abcdef
abcdef
abcdef
abcdef
abcdef
aacab
abcabc
then call .GetHashCode() on the result if necessary.
sample usage if ToCanonicalOrder() is converted to an extension method:
sequence.ToCanonicalOrder().GetHashCode();

One possibility is to combine the hash functions of all circular shifts of your input into one meta-hash which does not depend on the order of the inputs.
More formally, consider
for(int i=0; i<string.length; i++) {
result^=string.rotatedBy(i).hashCode();
}
Where you could replace the ^= with any other commutative operation.
More examply, consider the input
"abcd"
to get the hash we take
hash("abcd") ^ hash("dabc") ^ hash("cdab") ^ hash("bcda").
As we can see, taking the hash of any of these permutations will only change the order that you are evaluating the XOR, which won't change its value.

I did something like this for a project in college. There were 2 approaches I used to try to optimize a Travelling-Salesman problem. I think if the elements are NOT guaranteed to be unique, the second solution would take a bit more checking, but the first one should work.
If you can represent the string as a matrix of associations so abcdef would look like
a b c d e f
a x
b x
c x
d x
e x
f x
But so would any combination of those associations. It would be trivial to compare those matrices.
Another quicker trick would be to rotate the string so that the "first" letter is first. Then if you have the same starting point, the same strings will be identical.
Here is some Ruby code:
def normalize_string(string)
myarray = string.split(//) # split into an array
index = myarray.index(myarray.min) # find the index of the minimum element
index.times do
myarray.push(myarray.shift) # move stuff from the front to the back
end
return myarray.join
end
p normalize_string('abcdef').eql?normalize_string('defabc') # should return true

Maybe use a rolling hash for each offset (RabinKarp like) and return the minimum hash value? There could be collisions though.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Maximum repeating substring of size n - string

Related

Longest common prefix - comparing time complexity of two algorithms

Leetcode--3 find the longest substring without repeating character

O(n) time complexity and O(1) space complexity way to see if two strings are permutations of each other

Google Interview : Find Crazy Distance Between Strings

Is there a circular hash function?

Categories

Resources