is that possible to compare two substrings of the same length from the same given big string faster than O(n) ? ( where n is the length of the substrings )
I mean, if you have queries like "compare substring between x1, y1 positions with the one from x2, y2 positions"
You could compute a suffix array for the big string.
This array tells you the order of the string starting at x1 compared to the order of the string starting at X2.
You will need to check that the strings have diverged (or else the strings could be equal) before you get to the end. You could do this using a rolling hash, or by using a longest common prefix array.
There is a good tutorial on suffix array HERE
In principal, there are faster string searching algorithms than checking every character, but they are all proportional to the amount of data being searched.
e.g. extbndm, Boyer-Moore
There are multi-string searching algorithms (e.g. Aho-Corasaik), but at the end of the day, they are all order O(n). You can manage the constant part, but you need to search the whole string.
Related
I have encountered this question while studying for algorithms test:
Given a set of k words (strings), with a total character count of n, (meaning the sum of all words lengths are n), perform some sort of manipulation on the words in O(n) time, such that whenever 2 words are being compared, return answer (whether they are identical or not) in O(1) time.
It's an interesting question but I could not find any direction to deal with it...
Construct a trie of all of the words, and for each word store the index of its last character in the array. This is a O(n) operation.
Given two words, they are the same if and only if the index of their last character is the same.
Got this question in a recent interview. Basic String compare with a little twist. I have an input String, STR1 = 'ABC'. I should return "Same/Similar" when the string to compare, STR2 has anyone of these values - 'ACB' 'BAC' 'ABC' 'BCA' 'CAB' 'CBA' (That is same characters, same length and same no of occurrences). The only answer struck at that moment was to proceed with 'Merge sort' or 'Quick Sort' since it's complexity is logarithmic. Is there any other better algorithm to achieve the above result?
Sorting both, and comparing the results for equality, is not a bad approach for strings of reasonable lengths.
Another approach is to use a map/dictionary/object (depending on language) from character to number-of-occurrences. You then iterate over the first string, incrementing the counts, and iterate over the second string, decrementing them. You can return false as soon as you get a negative number.
And if your set of possible characters is small enough to be considered constant, you can use an array as the "map", resulting in O(n) worst-case complexity.
Supposing you can use any language, I would opt for a python 'dictionary' solution. You could use 2 dictionaries having as keys each string's characters. Then you can compare the dictionaries and return the respective result. This actually works for strings with characters that appear more than once.
If I want to find the longest common substring for 2 strings then which approach will be more efficient in terms of time/space complexity: using suffix arrays of DP?
DP will incur O(m*n) space with O(m*n) time complexity, what will be the time complexity of the suffix array approach?
1) Calculate the suffixes O(m) + O(n)
2) Sort them O(m+n log2(m+n))
3) Finding longest common prefix for m+n-1 strings? [I'm not sure how to calculate #of comparisons]
Suffix arrays allow us to do many more things with the sub-strings (like search for sub-string etc.), but since in this case rest of the functions are not needed, will DP be considered an easier/cleaner approach?Which one should be used in the case where we are comparing 2 strings?
Also, what if we have more than 2 strings?
Suffix array would be better. The LCS(longest common substring for n strings) problem can be solve as below:
Concatenate S1, S2, ..., Sn as follows:
S = S1$1S2$2...$nSn, Here $i are special symbols (sentinels) that are different and
lexicographically less than other symbols of the initial alphabet.
Compute the suffix array. Generally, We implemented suffix array in O(n*log n) but there is an important algorithm called DC3 which computes suffix arrays in O(n), n is the total length of N strings. You can google this algorithm.
Compute the LCP of all adjacent suffixes.
Is it possible to still perform a O(n) time complexity to search multiple occurrences of Knuth–Morris–Pratt algorithm?
Suppose we have a string S[0,...,N]. Recall that the ith entry in the prefix array stores the length of the maximal prefix of S[0,...,i] that matches the suffix.
We can calculate the prefix array P for pattern$subject (assuming that $ doesn't occur in subject). It remains to find indices such that P[i]==length(pattern), which can be done in linear time.
I am on an interview ride here. One more interview question I had difficulties with.
“A rose is a rose is a rose” Write an
algorithm that prints the number of
times a character/word occurs. E.g.
A – 3 Rose – 3 Is – 2
Also ensure that when you are printing
the results, they are in order of
what was present in the original
sentence. All this in order n.
I did get solution to count number of occurrences of each word in sentence in the order as present in the original sentence. I used Dictionary<string,int> to do it. However I did not understand what is meant by order of n. That is something I need an explanation from you guys.
There are 26 characters, So you can use counting sort to sort them, in your counting sort you can have an index which determines when specific character visited first time to save order of occurrence. [They can be sorted by their count and their occurrence with sort like radix sort].
Edit: by words first thing every one can think about it, is using Hash table and insert words in hash, and in this way count them, and They can be sorted in O(n), because all numbers are within 1..n steel you can sort them by counting sort in O(n), also for their occurrence you can traverse string and change position of same values.
Order of n means you traverse the string only once or some lesser multiple of n ,where n is number of characters in the string.
So your solution to store the String and number of its occurences is O(n) , order of n, as you loop through the complete string only once.
However it uses extra space in form of the list you created.
Order N refers to the Big O computational complexity analysis where you get a good upper bound on algorithms. It is a theory we cover early in a Data Structures class, so we can torment, I mean help the student gain facility with it as we traverse in a balanced way, heaps of different trees of knowledge, all different. In your case they want your algorithm to grow in compute time proportional to the size of the text as it grows.
It's a reference to Big O notation. Basically the interviewer means that you have to complete the task with an O(N) algorithm.
"Order n" is referring to Big O notation. Big O is a way for mathematicians and computer scientists to describe the behavior of a function. When someone specifies searching a string "in order n", that means that the time it takes for the function to execute grows linearly as the length of that string increases. In other words, if you plotted time of execution vs length of input, you would see a straight line.
Saying that your function must be of Order n does not mean that your function must equal O(n), a function with a Big O less than O(n) would also be considered acceptable. In your problems case, this would not be possible (because in order to count a letter, you must "touch" that letter, thus there must be some operation dependent on the input size).
One possible method is to traverse the string linearly. Then create a hash and list. The idea is to use the word as the hash key and increment the value for each occurance. If the value is non-existent in the hash, add the word to the end of the list. After traversing the string, go through the list in order using the hash values as the count.
The order of the algorithm is O(n). The hash lookup and list add operations are O(1) (or very close to it).