String sorting using Merge Sort - string

What will be the worst complexity for sorting n strings having n characters each? Will it be just n times its avg. case O(n log n) or something else...?

When you are talking about O notation with two things with different lengths, typically you want to use different variables, like M and N.
So, if your merge sort is O(N log N), where N is the number of strings... and comparing two strings is O(M) where M scales with the length of the string, then you'll be left with:
O(N log N) * O(M)
or
O(M N log N)
where M is the string length and N is the number of strings. You want to use different labels because they don't mean the same thing.
In the strange case where the average string length scales with the number of strings, like if you had a matrix stored in strings or something like that, you could argue that M = N, and then you'd have O(N^2 log N)

As #orangeoctopus, using standard ranking algorithm on a collection of n strings of size n will result in O(n^2 * logn) computation.
However - note that you can do it in O(n^2), with variations on radix sort.
The simplest way to do it [in my opinion] - is
build a trie, and populate it with all your strings. Entering
each string is O(n) and you do it n times - total of O(n^2)
do a DFS on the trie, each time you encounter the mark for end for string - add it to the sorted collection. The order of the strings added this way is lexicographically, so your list will be sorted lexicographically when you are done.
It is easy to see you cannot do it any better then O(n^2), since only reading the data is O(n^2), thus this solution is optimal in terms of big O notation of time complexity.

Sorting n items with MergeSort requires O(N LogN) comparisons. If the time to compare two items is O(1) then the total running time will be O(N logN). However, comparing two strings of length N requires O(N) time, so a naive implementation might stuck with O(N*N logN) time.
This seems wasteful because we are not taking advantage of the fact that there are only N strings around to make comparisons. We might somehow preprocess the strings so that comparisons take less time on average.
Here is an idea. Create a Trie structure and put N strings there. The trie will have O(N*N) nodes and require O(N*N) time to build. Traverse the tree and put an integer "ranking" to each node at the tree; If R(N1)<R(N2) then the string associated with Node1 comes before the string associated with Node2 in a dictionary.
Now proceed with Mergesort, do the comparisons in O(1) time by looking up the Trie. The total running time will be O(N*N + N*logN) = O(N*N)
Edit: My answer is very similar to that of #amit. However I proceed with mergesort where he proceeds with radixsort after the trie building step.

Related

Will I have a smaller time complexity if I use both the head and tail pointers in a doubly linked list to search an element?

So if I want to search an element,using a doubly-linked list,will I get a less consuming time complexity such as O(logN) if I search from both sides(the beginning of the list and the end of the list) at the same time or will I still get linear time?
You will still get linear time complexity if you're traversing links in a doubly-linked list. Binary search's logarithmic time complexity depends on index-based random access of array elements in a sorted list. Consider a doubly-linked list with n/2 instances of a constant c followed by n/2 instances of 2c. To determine a number b where c < b < 2c is not in such a list you'd definitely have to check n/2 entries regardless of which end you search from. Even having the entries in sorted order doesn't help since to check the middle you'd need to traverse half the list.

TextRank Algorithm Space and Time Complexity

I am trying to determine the space and time complexity for TextRank the algorithm listed in this paper:
https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf
Since it is using PageRank whose complexity is:
O(n+m) ( n - number of nodes, m - number of arcs/edges)
and we run it over i iterations/until convergence the complexity for keyword extraction I believe it would be: O(i*(n+m))
and the space complexity would be O(V^2) using an adjacency matrix
While for sentence extraction I believe it would be the same thing.
I'm really not sure and any help would be great Thank you.
If you repeat T times an algorithm (inner) with complexity O(n+m), or whatever other for that matter, it is correct to conclude that the new algorithm (outer) will have a complexity of O(T*(n+m)) provided:
The outer algorithm will only add a constant complexity every time it repeats the inner one.
Parameters n and m remain the same at every invocation of the inner algorithm.
In other words, the outer algorithm should prepare the inputs for the inner one in constant time, and the parameters of new inputs should remain well represented by n and m along the T iterations. Otherwise, if any of these two requirements fail to be proved, you should sum T times the complexities associated to the new parameters, say
O(n1 + m1) + ... + O(n_T + m_T)
and also take into account all the pre- and post-processing of the outer algorithm before and after using the inner.

Naive Suffix Array time complexity

I'm trying to invent programming exercise on Suffix Arrays. I learned O(n*log(n)^2) algorithm for constructing it and then started playing with random input strings of varying length in order to find out when naive approach becomes too slow. E.g. I wanted to choose string length so that people will need to implement "advanced" algorithm.
Suddenly I found that naive algorithm (with using logarithmic sort on all suffixes) is not as slow as O(n^2 * log(n)) means. After thinking a bit, I understand that comparison of suffixes of a randomly generated string is not O(n) amortized. Really, we usually only compare few first characters before we come to difference and there we return from comparison function. This of course depends on the size of the alphabet, but anyway it does not depend much on the length of suffixes.
I tried simple implementation in PHP processing 50000-characters string in 2 seconds (despite slowness of scripting language). If it will work at least as O(n^2) we'll expect it to work at least several minutes (with 1e7 operations per second and ~1e9 operations total).
So I understand that even if it is O(n^2 * log(n)) then the constant factor is a very small fraction of 1, really something close to 0. Or we should say about such complexity as worst-case only, right?
But what is the amortized time complexity of the naive approach? I'm bit bewildered about how to assess it.
You seem to be confusing amortized and expected complexity. In this case you are talking about expected complexity. And yes the stated complexity is computed assuming that the suffix comparison takes O(n). This will be the worst case for suffix comparison and for random generated input you will only perform constant number of comparisons in most cases. Thus O(n^2*log(n)) is worst case complexity.
One more note - on a modern computer you can perform a few billion elementary instructions in a second and it is possible that you execute in the order of 50000^2 in 2 seconds. The correct way to benchmark complexity of an algorithm is to measure the time it takes to complete e.g. for input of size N, N*2, N*4,...(as many as you can go) and then to interpolate the function that would describe the computational complexity

Longest common substring in two sequences of strings

Having just learned the longest common substring algorithm, I was curious about a particular variant of the problem. It is described as follows -:
Given two non-empty sequences of strings, X = (x1, x2, x3,....,x(n)) and Y = (y1, y2, y3,..., y(m)), where x(i) and y(i) are strings of characters, find the longest string in X which is a substring of all the strings of Y.
I have a function substring(x, y) which returns booleans depicting whether x is a substring in y or not. Evidently, I have to concatenate all the strings in Y to form one big string, say, denoted by B. I thought of the following approaches -:
Naive: Start by concatenating all strings in X to form a string A(n). Apply substring(A(n), B) - this includes iterating backward in string A(n). If true, the algorithm ends here and returns A(n) - or whatever portion of it is included in said substring. If not, proceed to apply (A(n - 1), B) and so on. If such a string does not exist in X, I return the empty string.
Obviously this approach would take up quite some running time depending on the implementation. Assuming I use an iterative approach, on each iteration I would have to iterate backward through the String at that level/index, and subsequently apply substring(). It would take atleast two loops, and O(size(B) * maxlength(x1, x2,...)) worst case time, or more depending on substring() (correct me if wrong).
I thought of a second approach based on suffix trees/arrays.
Generalized Suffix Tree: I build a GST of sequence Y using Ukkonen's algorithm in O(maxlength(y1, y2,...)(?). My lack of knowledge of suffix trees bites. I believe a suffix tree approach would substantially reduce running time (at the cost of space) for finding the substring, but I have no idea how to implement the operation.
If there is a better approach, I'd love to know.
EDIT: Apologies if it seemed like I abandoned the topic.
What if I were to use not a GST, but some standard data structure such as a stack, queue, set, heap, priority queue, etc.? The sequence X would have to be sorted, largest string first, naturally. If I store it in a string array, I will have to use a sorting algorithm such as mergesort/quicksort. The goal is to get the most efficient running time as possible.
Can I not store X in a structure that automatically sorts its elements as it builds itself? What about a max-heap?
It would seem like the suffix tree is the best way to find substrings in this fashion. Is there any other data structure I could use?
First, order the array X for the longest string to shorter. This way, the first string in X that be a substring of all Y strings is the solution.
A multiprocessor algorithm would be the best way to solve the problem of test each X string with all Y strings quickly.
Here is my idea about a solution of your problem; I am not sure about everything so comments are welcome to improve it if you think it worths the effort.
Begin with computing all common substrings of all strings in Y. First take two strings, and build a tree of all common substrings. Then, for each other string in Y, remove from the map every substring that does not appear in this string. The complexity is linear with the number of strings in Y, but I can't figure out how many elements might be in the tree so I cannot draw an estimation of the final complexity.
Then find the longest string in X which is a substring of one in the tree.
There must be some improvements to do to keep the tree as small as possible, such as keeping only substrings that are not substrings of others.
Writing |Y| for the number of strings in the set Y, and len(Y) for their total length:
Process the strings in Y into a generalized suffix tree (for example, using Ukkonen's algorithm). Takes time O(len(Y)), assuming a constant-size alphabet.
Mark each node in the suffix tree according to whether the string identified by that node belongs to all the strings in Y. Takes time O(|Y| len(Y)).
For each string in X, look it up in the suffix tree and see if the node is marked as belonging to all the strings in Y. Output the longest such marked string. Takes time O(len(X)).
Total time: O(|Y| len(Y)) + O(len(X)).

An efficient algorithm for finding smallest pangrammatic windows?

A pangrammatic window is a substring of a larger piece of text that contains all 26 letters of the alphabet. To quote an example from Wikipedia, given this text:
I sang, and thought I sang very well; but he just looked up into my face with a very quizzical expression, and said, 'How long have you been singing, Mademoiselle?'
The smallest pangrammatic window in the text is this string:
g very well; but he just looked up into my face with a very quizzical ex
Which indeed contains every letter at least once.
My question is this: Given a text corpus, what is the most efficient algorithm for finding the smallest pangrammatic window in the text?
I've given this some thought and come up with the following algorithms already. I have a strong feeling that these are not optimal, but I thought I'd post them as a starting point.
There is a simple naive algorithm that runs in time O(n2) and space O(1): For each position in the string, scan forward from that position and track what letters you've seen (perhaps in a bit vector, which, since there are only 26 different letters, takes space O(1)). Once you've found all 26 letters, you have the length of the shortest pangrammatic window starting at that given point. Each scan might take time O(n), and there are O(n) scans, for a grand total of O(n2) time.
We can also solve this problem in time O(n log n) and space O(n) using a modified binary search. Construct 26 arrays, one for each letter of the alphabet, then populate those arrays with the positions of each letter in the input text in sorted order. We can do this by simply scanning across the text, appending each index to the array corresponding to the current character. Once we have this, we can find, in time O(log n), the length of the shortest pangrammatic window beginning at some index by running 26 binary searches in the arrays to find the earliest time that each character appears in the input array at or after the given index. Whichever of these numbers is greatest gives the "long pole" character that appears furthest down in the string, and thus gives the endpoint of the pangrammatic window. Running this search step takes O(log n) time, and since we have to do it for all n characters in the string, the total runtime is O(n log n), with O(n) memory usage for the arrays.
A further refinement for the above approach is to replace the arrays and binary search with van Emde Boas trees and predecessor searches. This increases the creation time to O(n log log n), but reduces each search time to O(log log n) time, for a net runtime of O(n log log n) with O(n) space usage.
Are there any better algorithms out there?
For every letter keep track of the recent-most sighting. Whenever you process a letter, update the corresponding sighting index and calculate the range (max-min) of sighting indexes over all letters. Find the location with minimum range.
Complexity O(n). O(nlog(m)) if you consider alphabet size m.
This algorithm has O(M) space complexity and O(N) time complexity (time does not depend on alphabet size M):
Advance first iterator and increase counter for each processed letter. Stop when all 26 counters are non-zero.
Advance second iterator and decrease counter for each processed letter. Stop when any of these counters is zero.
Use difference between iterators to update best-so-far result and continue with step 1.
This algorithm may be improved a little bit if instead of character counters, positions in the string are stored. In this case step 2 should only read these positions and compare with current position, and step 1 should update these positions and (most of the time) search for some character in the text.

Resources