Longest common substring in two sequences of strings

Longest common substring in two sequences of strings - string

Having just learned the longest common substring algorithm, I was curious about a particular variant of the problem. It is described as follows -:
Given two non-empty sequences of strings, X = (x1, x2, x3,....,x(n)) and Y = (y1, y2, y3,..., y(m)), where x(i) and y(i) are strings of characters, find the longest string in X which is a substring of all the strings of Y.
I have a function substring(x, y) which returns booleans depicting whether x is a substring in y or not. Evidently, I have to concatenate all the strings in Y to form one big string, say, denoted by B. I thought of the following approaches -:
Naive: Start by concatenating all strings in X to form a string A(n). Apply substring(A(n), B) - this includes iterating backward in string A(n). If true, the algorithm ends here and returns A(n) - or whatever portion of it is included in said substring. If not, proceed to apply (A(n - 1), B) and so on. If such a string does not exist in X, I return the empty string.
Obviously this approach would take up quite some running time depending on the implementation. Assuming I use an iterative approach, on each iteration I would have to iterate backward through the String at that level/index, and subsequently apply substring(). It would take atleast two loops, and O(size(B) * maxlength(x1, x2,...)) worst case time, or more depending on substring() (correct me if wrong).
I thought of a second approach based on suffix trees/arrays.
Generalized Suffix Tree: I build a GST of sequence Y using Ukkonen's algorithm in O(maxlength(y1, y2,...)(?). My lack of knowledge of suffix trees bites. I believe a suffix tree approach would substantially reduce running time (at the cost of space) for finding the substring, but I have no idea how to implement the operation.
If there is a better approach, I'd love to know.
EDIT: Apologies if it seemed like I abandoned the topic.
What if I were to use not a GST, but some standard data structure such as a stack, queue, set, heap, priority queue, etc.? The sequence X would have to be sorted, largest string first, naturally. If I store it in a string array, I will have to use a sorting algorithm such as mergesort/quicksort. The goal is to get the most efficient running time as possible.
Can I not store X in a structure that automatically sorts its elements as it builds itself? What about a max-heap?
It would seem like the suffix tree is the best way to find substrings in this fashion. Is there any other data structure I could use?

First, order the array X for the longest string to shorter. This way, the first string in X that be a substring of all Y strings is the solution.
A multiprocessor algorithm would be the best way to solve the problem of test each X string with all Y strings quickly.

Here is my idea about a solution of your problem; I am not sure about everything so comments are welcome to improve it if you think it worths the effort.
Begin with computing all common substrings of all strings in Y. First take two strings, and build a tree of all common substrings. Then, for each other string in Y, remove from the map every substring that does not appear in this string. The complexity is linear with the number of strings in Y, but I can't figure out how many elements might be in the tree so I cannot draw an estimation of the final complexity.
Then find the longest string in X which is a substring of one in the tree.
There must be some improvements to do to keep the tree as small as possible, such as keeping only substrings that are not substrings of others.

Writing |Y| for the number of strings in the set Y, and len(Y) for their total length:
Process the strings in Y into a generalized suffix tree (for example, using Ukkonen's algorithm). Takes time O(len(Y)), assuming a constant-size alphabet.
Mark each node in the suffix tree according to whether the string identified by that node belongs to all the strings in Y. Takes time O(|Y| len(Y)).
For each string in X, look it up in the suffix tree and see if the node is marked as belonging to all the strings in Y. Output the longest such marked string. Takes time O(len(X)).
Total time: O(|Y| len(Y)) + O(len(X)).

Related

Quickest way to find closest set of point

I have three arrays of points:
A=[[5,2],[1,0],[5,1]]
B=[[3,3],[5,3],[1,1]]
C=[[4,2],[9,0],[0,0]]
I need the most efficient way to find the three points (one for each array) that are closest to each other (within one pixel in each axis).
What I'm doing right now is taking one point as reference, let's say A[0], and cycling all other B and C points looking for a solution. If A[0] gives me no result I'll move the reference to A[1] and so on. This approach as a huge problem because if I increase the number of points for each array and/or the number of arrays it requires too much time to converge some times, especially if the solution is in the last members of the arrays. So I'm wondering if there is any way to do this without maybe using a reference, or any quicker way than just looping all over the elements.
The rules that I must follow are the following:
the final solution has to be made by only one element from each array like: S=[A[n],B[m],C[j]]
each selected element has to be within 1 pixel in X and Y from ALL the other members of the solution (so Xi-Xj<=1 and Yi-Yj<=1 for each member of the solution).
For example in this simplified case the solution would be: S=[A[1],B[2],C[1]]
To clarify further the problem: what I wrote above it's just a simplify example to explain what I need. In my real case I don't know a priori the length of the lists nor the number of lists I have to work with, could be A,B,C, or A,B,C,D,E... (each of one with different number of points) etc. So I also need to find a way to make it as general as possible.

This requirement:
each selected element has to be within 1 pixel in X and Y from ALL the other members of the solution (so Xi-Xj<=1 and Yi-Yj<=1 for each member of the solution).
massively simplifies the problem, because it means that for any given (xi, yi), there are only nine possible choices of (xj, yj).
So I think the best approach is as follows:
Copy B and C into sets of tuples.
Iterate over A. For each point (xi, yi):
Iterate over the values of x from xi−1 to xi+1 and the values of y from yi−1 to yi+1. For each resulting point (xj, yj):
Check if (xj, yj) is in B. If so:
Iterate over the values of x from max(xi, xj)−1 to min(xi, xj)+1 and the values of y from max(yi, yj)−1 to min(yi, yj)+1. For each resulting point (xk, yk):
Check if (xk, yk) is in C. If so, we're done!
If we get to the end without having a match, that means there isn't one.
This requires roughly O(len(A) + len(B) + len(C)) time and O(len(B) + len(C) extra space.
Edited to add (due to a follow-up question in the comments): if you have N lists instead of just 3, then instead of nesting N loops deep (which gives time exponential in N), you'll want to do something more like this:
Copy B, C, etc., into sets of tuples, as above.
Iterate over A. For each point (xi, yi):
Create a set containing (xi, yi) and its eight neighbors.
For each of the lists B, C, etc.:
For each element in the set of nine points, see if it's in the current list.
Update the set to remove any points that aren't in the current list and don't have any neighbors in the current list.
If the set still has at least one element, then — great, each list contained a point that's within one pixel of that element (with all of those points also being within one pixel of each other). So, we're done!
If we get to the end without having a match, that means there isn't one.
which is much more complicated to implement, but is linear in N instead of exponential in N.

Currently, you are finding the solution with a bruteforce algorithm which has a O(n2) complexity. If your lists contains 1000 items, your algo will need 1000000 iterations to run... (It's even O(n3) as tobias_k pointed out)
Like you can see there: https://en.wikipedia.org/wiki/Closest_pair_of_points_problem, you could improve it by using a divide and conquer algorithm, which would run in a O(n log n) time.
You should search for Delaunay triangulation and/or Voronoi diagram implementations.
NB: if you can use external libs, you should also consider taking a look at the scipy lib: https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.spatial.Delaunay.html

Detect periodical string

I am trying to solve this problem, I couldn't get to linear time.
A string T is called periodical if it can be represented in the form
of T=PPP..P.
Design a linear time algorthim for deciding whether a
given T is periodical,and if it's true, find the shortest period.
My approach:
if T=AB=BA then T is periodical, my algorithm keep checking if string can be represented like that,if yes then I check for half of it.
It takes O(n*log(n)) time.
Thanks guys

KMP search algorithm computes, in part, the longest substring that's both a prefix and a suffix (shorter than the entire string).
If you apply it to a periodical string you'll get
len(string) - len(substring) = period
len(substring) must be > len(string) / 2, otherwise there's no period.
The period found will also be the shortest period.
KMP is linear.
So check it out (wikipedia).

Generate test cases for levenshtein distance implementation with quickCheck

As part of me learning about quickCheck I want to build a test generator for a levenshtein edit distance implementation. The obvious approach I think is to start with two equal strings and a random non-reducable series of insert/delete/traspose actions, apply that to one of the strings and assert that the levenshtein distance is the length of the random series.
I am quite stuck with this can someone help?

Getting "non-reducible" right sounds pretty hard. I would try to find a larger number of less complicated invariants. Here are some ideas:
The edit distance between any string and itself is 0.
No two strings have a negative edit distance.
For an arbitrary string x, if you apply exactly one change to it, producing y, the edit distance between x and y should be 1.
Given two strings x and y, compute the distance d between them. Then, change y, yielding y', and compute its distance from x: it should differ from d by at most 1.
After applying n edits to a string x, the distance between the edited string and x should be at most n. Note that case (1) is a special case of this, where n=0, so you could omit that one for concision if you like. Or, keep it around, since case (1) may generate simpler counterexamples.
The function should be symmetric: the edit distance from x to y should be the same as from y to x.
If you have another, known-good implementation of the algorithm to test against, you could compare to that, and assert that you always get the same answer as it does.
The above were all just things that appealed to me without any research. You could do more: for example, encode the lower and upper bounds as defined by wikipedia.

An efficient algorithm for finding smallest pangrammatic windows?

A pangrammatic window is a substring of a larger piece of text that contains all 26 letters of the alphabet. To quote an example from Wikipedia, given this text:
I sang, and thought I sang very well; but he just looked up into my face with a very quizzical expression, and said, 'How long have you been singing, Mademoiselle?'
The smallest pangrammatic window in the text is this string:
g very well; but he just looked up into my face with a very quizzical ex
Which indeed contains every letter at least once.
My question is this: Given a text corpus, what is the most efficient algorithm for finding the smallest pangrammatic window in the text?
I've given this some thought and come up with the following algorithms already. I have a strong feeling that these are not optimal, but I thought I'd post them as a starting point.
There is a simple naive algorithm that runs in time O(n2) and space O(1): For each position in the string, scan forward from that position and track what letters you've seen (perhaps in a bit vector, which, since there are only 26 different letters, takes space O(1)). Once you've found all 26 letters, you have the length of the shortest pangrammatic window starting at that given point. Each scan might take time O(n), and there are O(n) scans, for a grand total of O(n2) time.
We can also solve this problem in time O(n log n) and space O(n) using a modified binary search. Construct 26 arrays, one for each letter of the alphabet, then populate those arrays with the positions of each letter in the input text in sorted order. We can do this by simply scanning across the text, appending each index to the array corresponding to the current character. Once we have this, we can find, in time O(log n), the length of the shortest pangrammatic window beginning at some index by running 26 binary searches in the arrays to find the earliest time that each character appears in the input array at or after the given index. Whichever of these numbers is greatest gives the "long pole" character that appears furthest down in the string, and thus gives the endpoint of the pangrammatic window. Running this search step takes O(log n) time, and since we have to do it for all n characters in the string, the total runtime is O(n log n), with O(n) memory usage for the arrays.
A further refinement for the above approach is to replace the arrays and binary search with van Emde Boas trees and predecessor searches. This increases the creation time to O(n log log n), but reduces each search time to O(log log n) time, for a net runtime of O(n log log n) with O(n) space usage.
Are there any better algorithms out there?

For every letter keep track of the recent-most sighting. Whenever you process a letter, update the corresponding sighting index and calculate the range (max-min) of sighting indexes over all letters. Find the location with minimum range.
Complexity O(n). O(nlog(m)) if you consider alphabet size m.

This algorithm has O(M) space complexity and O(N) time complexity (time does not depend on alphabet size M):
Advance first iterator and increase counter for each processed letter. Stop when all 26 counters are non-zero.
Advance second iterator and decrease counter for each processed letter. Stop when any of these counters is zero.
Use difference between iterators to update best-so-far result and continue with step 1.
This algorithm may be improved a little bit if instead of character counters, positions in the string are stored. In this case step 2 should only read these positions and compare with current position, and step 1 should update these positions and (most of the time) search for some character in the text.

String sorting using Merge Sort

What will be the worst complexity for sorting n strings having n characters each? Will it be just n times its avg. case O(n log n) or something else...?

When you are talking about O notation with two things with different lengths, typically you want to use different variables, like M and N.
So, if your merge sort is O(N log N), where N is the number of strings... and comparing two strings is O(M) where M scales with the length of the string, then you'll be left with:
O(N log N) * O(M)
or
O(M N log N)
where M is the string length and N is the number of strings. You want to use different labels because they don't mean the same thing.
In the strange case where the average string length scales with the number of strings, like if you had a matrix stored in strings or something like that, you could argue that M = N, and then you'd have O(N^2 log N)

As #orangeoctopus, using standard ranking algorithm on a collection of n strings of size n will result in O(n^2 * logn) computation.
However - note that you can do it in O(n^2), with variations on radix sort.
The simplest way to do it [in my opinion] - is
build a trie, and populate it with all your strings. Entering
each string is O(n) and you do it n times - total of O(n^2)
do a DFS on the trie, each time you encounter the mark for end for string - add it to the sorted collection. The order of the strings added this way is lexicographically, so your list will be sorted lexicographically when you are done.
It is easy to see you cannot do it any better then O(n^2), since only reading the data is O(n^2), thus this solution is optimal in terms of big O notation of time complexity.

Sorting n items with MergeSort requires O(N LogN) comparisons. If the time to compare two items is O(1) then the total running time will be O(N logN). However, comparing two strings of length N requires O(N) time, so a naive implementation might stuck with O(N*N logN) time.
This seems wasteful because we are not taking advantage of the fact that there are only N strings around to make comparisons. We might somehow preprocess the strings so that comparisons take less time on average.
Here is an idea. Create a Trie structure and put N strings there. The trie will have O(N*N) nodes and require O(N*N) time to build. Traverse the tree and put an integer "ranking" to each node at the tree; If R(N1)<R(N2) then the string associated with Node1 comes before the string associated with Node2 in a dictionary.
Now proceed with Mergesort, do the comparisons in O(1) time by looking up the Trie. The total running time will be O(N*N + N*logN) = O(N*N)
Edit: My answer is very similar to that of #amit. However I proceed with mergesort where he proceeds with radixsort after the trie building step.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string