An efficient algorithm for finding smallest pangrammatic windows? - string

A pangrammatic window is a substring of a larger piece of text that contains all 26 letters of the alphabet. To quote an example from Wikipedia, given this text:
I sang, and thought I sang very well; but he just looked up into my face with a very quizzical expression, and said, 'How long have you been singing, Mademoiselle?'
The smallest pangrammatic window in the text is this string:
g very well; but he just looked up into my face with a very quizzical ex
Which indeed contains every letter at least once.
My question is this: Given a text corpus, what is the most efficient algorithm for finding the smallest pangrammatic window in the text?
I've given this some thought and come up with the following algorithms already. I have a strong feeling that these are not optimal, but I thought I'd post them as a starting point.
There is a simple naive algorithm that runs in time O(n2) and space O(1): For each position in the string, scan forward from that position and track what letters you've seen (perhaps in a bit vector, which, since there are only 26 different letters, takes space O(1)). Once you've found all 26 letters, you have the length of the shortest pangrammatic window starting at that given point. Each scan might take time O(n), and there are O(n) scans, for a grand total of O(n2) time.
We can also solve this problem in time O(n log n) and space O(n) using a modified binary search. Construct 26 arrays, one for each letter of the alphabet, then populate those arrays with the positions of each letter in the input text in sorted order. We can do this by simply scanning across the text, appending each index to the array corresponding to the current character. Once we have this, we can find, in time O(log n), the length of the shortest pangrammatic window beginning at some index by running 26 binary searches in the arrays to find the earliest time that each character appears in the input array at or after the given index. Whichever of these numbers is greatest gives the "long pole" character that appears furthest down in the string, and thus gives the endpoint of the pangrammatic window. Running this search step takes O(log n) time, and since we have to do it for all n characters in the string, the total runtime is O(n log n), with O(n) memory usage for the arrays.
A further refinement for the above approach is to replace the arrays and binary search with van Emde Boas trees and predecessor searches. This increases the creation time to O(n log log n), but reduces each search time to O(log log n) time, for a net runtime of O(n log log n) with O(n) space usage.
Are there any better algorithms out there?

For every letter keep track of the recent-most sighting. Whenever you process a letter, update the corresponding sighting index and calculate the range (max-min) of sighting indexes over all letters. Find the location with minimum range.
Complexity O(n). O(nlog(m)) if you consider alphabet size m.

This algorithm has O(M) space complexity and O(N) time complexity (time does not depend on alphabet size M):
Advance first iterator and increase counter for each processed letter. Stop when all 26 counters are non-zero.
Advance second iterator and decrease counter for each processed letter. Stop when any of these counters is zero.
Use difference between iterators to update best-so-far result and continue with step 1.
This algorithm may be improved a little bit if instead of character counters, positions in the string are stored. In this case step 2 should only read these positions and compare with current position, and step 1 should update these positions and (most of the time) search for some character in the text.

Related

Will I have a smaller time complexity if I use both the head and tail pointers in a doubly linked list to search an element?

So if I want to search an element,using a doubly-linked list,will I get a less consuming time complexity such as O(logN) if I search from both sides(the beginning of the list and the end of the list) at the same time or will I still get linear time?
You will still get linear time complexity if you're traversing links in a doubly-linked list. Binary search's logarithmic time complexity depends on index-based random access of array elements in a sorted list. Consider a doubly-linked list with n/2 instances of a constant c followed by n/2 instances of 2c. To determine a number b where c < b < 2c is not in such a list you'd definitely have to check n/2 entries regardless of which end you search from. Even having the entries in sorted order doesn't help since to check the middle you'd need to traverse half the list.

Longest Common Subsequence between very large strings

I am trying solve the Longest Common subsequence problem, which is the problem of finding the longest subsequence common to all sequences in a set of sequences (often just two sequences).
I am trying to do this to calculate the overlap between 2 strings.
This is well know Dynamic programming problem. However, In my case the strings are is too huge. When I tried to use the 2D matrix to memoize, I ran into memory out of bound problem.
One solution could be using sparse matrix instead but I am little concerned about the performance overhead with that.
Also I want to perform this algorithm across multiple strings. And it will be okay to provide approximate answer since I am only trying to measure the overlap between 2 strings.
EDIT: After some investigation I found the following alternatives
Hirschberg Algorithm https://en.wikipedia.org/wiki/Hirschberg%27s_algorithm
Original paper http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.348.4360&rep=rep1&type=pdf
Approximate algorithm : http://cs.haifa.ac.il/~ilan/online-papers/cpm09.pdf
Deposition and Extension Approach to Find Longest Common
Subsequence for Multiple Sequences https://arxiv.org/pdf/0903.2015.pdf
LCS on DNA sequence http://www.sersc.org/journals/IJAST/vol47/2.pdf
Efficient Algorithm http://www.sciencedirect.com/science/article/pii/S0885064X12000635
To reduce memory complexity, you don't need to store the entire 2D table. You can only store the row above and current row and thus you can reduce the memory consumption by O(N) if you store the maximum in another data-structure. This results in O(N) memory usage, but time complexity remains O(N^2).

How to find unknown repeated patterns in the set of strings?

Here is description of a problem. Suppose you have a set of strings (up to 10 billion of strings, each string length up to 10k characters, there are 1000 unique symbols string could be constructed from). How can I find patterns with length from 2 up to length N (lets say 10 for simplicity). Also I'd like to see only those patterns which occurs at least in 1% of all string (some threshold).
I'd like to find an algorithm which can help me solve this problem. The numbers are not exact but are the same order of magnitude as we have in project.
Thank you
Index all your strings in a suffix tree (link). This can be O(number of characters) and you only need to do it once before you start.
A suffix tree allows you to quickly(O(pattern length)) tell if a pattern appears in any of the strings you've indexed, and how many times.
You can do another pass through the structure and count the number of leafs in each subtree (O(N) again) and that tells you how often you can find the substring from the root to that node, so you can drop them or do whatever you want based on how common they are.
Now, 10 billion strings of length 10k, with 2 byte characters (to fit the 1000 unique symbols) is quite large (18TB if my math is right) which doesn't fit in ram. So you'll either need to wait for a while or get more computers and setup a distributted solution. You can apply the solution above to batches of strings so that they fit into your available memory, but the lookup in the structure needs to be multiplied by the number of batches you are doing.
If everything is in batches then the most efficient way would be to make batches as big as you can, then when you've build the suffix tree for a batch run all your queries through it, save the results and drop the tree to free memory for the next batch of input strings.

Number of distinct palindromic substrings

Given a string, I know how to find the number of palindromic substrings in linear time using Manacher's algorithm. But now I need to find the number of distinct/unique palindromic substrings. Now, this might lead to an O(n + n^2) algorithm - one 'n' for finding all such substrings, and n^2 for comparing each of these substrings with the ones already found, to check if it is unique.
I am sure there is an algorithm with better complexity. I was thinking of maybe trying my luck with suffix trees? Is there an algorithm with better time complexity?
I would just put substrings you found into the hash table to prevent holding the same results twice.
The access time to hash table is O(1).
As of 2015, there is a linear time algorithm for computing the number of distinct palindromic substrings of a given string S. You can use a data structure known as an eertree (or palindromic tree), as described in the linked paper. The idea is fairly complicated, but the premise is to build a trie of palindromes, and augment it with longest proper palindromic suffixes in a similar manner to the failure function of the Aho-Corasick Algorithm. See the original paper for more details: https://arxiv.org/pdf/1506.04862.pdf

String sorting using Merge Sort

What will be the worst complexity for sorting n strings having n characters each? Will it be just n times its avg. case O(n log n) or something else...?
When you are talking about O notation with two things with different lengths, typically you want to use different variables, like M and N.
So, if your merge sort is O(N log N), where N is the number of strings... and comparing two strings is O(M) where M scales with the length of the string, then you'll be left with:
O(N log N) * O(M)
or
O(M N log N)
where M is the string length and N is the number of strings. You want to use different labels because they don't mean the same thing.
In the strange case where the average string length scales with the number of strings, like if you had a matrix stored in strings or something like that, you could argue that M = N, and then you'd have O(N^2 log N)
As #orangeoctopus, using standard ranking algorithm on a collection of n strings of size n will result in O(n^2 * logn) computation.
However - note that you can do it in O(n^2), with variations on radix sort.
The simplest way to do it [in my opinion] - is
build a trie, and populate it with all your strings. Entering
each string is O(n) and you do it n times - total of O(n^2)
do a DFS on the trie, each time you encounter the mark for end for string - add it to the sorted collection. The order of the strings added this way is lexicographically, so your list will be sorted lexicographically when you are done.
It is easy to see you cannot do it any better then O(n^2), since only reading the data is O(n^2), thus this solution is optimal in terms of big O notation of time complexity.
Sorting n items with MergeSort requires O(N LogN) comparisons. If the time to compare two items is O(1) then the total running time will be O(N logN). However, comparing two strings of length N requires O(N) time, so a naive implementation might stuck with O(N*N logN) time.
This seems wasteful because we are not taking advantage of the fact that there are only N strings around to make comparisons. We might somehow preprocess the strings so that comparisons take less time on average.
Here is an idea. Create a Trie structure and put N strings there. The trie will have O(N*N) nodes and require O(N*N) time to build. Traverse the tree and put an integer "ranking" to each node at the tree; If R(N1)<R(N2) then the string associated with Node1 comes before the string associated with Node2 in a dictionary.
Now proceed with Mergesort, do the comparisons in O(1) time by looking up the Trie. The total running time will be O(N*N + N*logN) = O(N*N)
Edit: My answer is very similar to that of #amit. However I proceed with mergesort where he proceeds with radixsort after the trie building step.

Resources