how to sort strings in dictionary order?

how to sort strings in dictionary order? - string

I have a file of strings and I have to sort them in dictionary order in O(nlog(n)) or less time. I know the sorting algorithms and applied them to sort numbers. but I have no Idea how to sort strings by using Quick sort or any other sorting algorithm.
Please provide the algorithms not built in methods.

For strings, common suggestion may be radix sort
It strictly depends from alphabet that was used for strings forming and is O(kN) time complexity, where n is number of keys and k is average key length. Note, that it may be confusing to compare this with O(n log n) (where n means number of input elements )
So - the lesser is k, the better is radix sort approach. That means, for lesser radix it will be more effective. I would just quote extended explanation (no need to rephrase it):
The topic of the efficiency of radix sort compared to other sorting
algorithms is somewhat tricky and subject to quite a lot of
misunderstandings. Whether radix sort is equally efficient, less
efficient or more efficient than the best comparison-based algorithms
depends on the details of the assumptions made. Radix sort efficiency
is O(d·n) for n keys which have d or fewer digits. Sometimes d is
presented as a constant, which would make radix sort better (for
sufficiently large n) than the best comparison-based sorting
algorithms, which are all O(n·log(n)) number of comparisons needed.
However, in general d cannot be considered a constant. In particular,
under the common (but sometimes implicit) assumption that all keys are
distinct, then d must be at least of the order of log(n), which gives
at best (with densely packed keys) a time complexity O(n·log(n)). That
would seem to make radix sort at most equally efficient as the best
comparison-based sorts (and worse if keys are much longer than
log(n)).
Also, this algorithm will use some additional space (with worth space complexity O(k+n)) - thus, you should be aware of that (not like comparative algorithms which won't use additional space)

You look at the numerical values (ascii values) of the letters and use them to do the numerical order.
So you look at a word from left to right. If the letters match, you look at the second etc.

Related

Average Case Big O and the Impact of Sorting

I'm looking at the time complexity for implementations of a method which determines if a String contains all unique characters.
The basic, brute force, approach would be to iterate through the String one character at a time maintaining a HashSet of seen characters. For each character in the iteration we check if the Set already contains it, and if so return false. We return true if the entire String has been searched. This would be O(n) as a worst case complexity. What would be the average case? O(n/2)?
If we try to optimise this by sorting the String into a char array, would it be more or less efficient? Sorting typically takes O(n log n) which is worse than O(n), but a sorted String allows for duplicate characters to be detected much earlier (especially for long strings).
Do we say the worst case is O(n^2 log n) but the average case is better? If so, what is it?

In the un-sorted case, the average case depends entirely on the string! Without knowing/assuming any distribution, it's hard to make any assumption.
A simple case, for a string with randomly-placed characters, where one of the characters repeats once:
the number of possibilities for the repeated characters being arranged is n*(n-1)/2
the probability it is detected repeated in exactly k steps is (k-1)/(n-1)
the probability it is detected in at most k steps is (k*(k-1))/(n*(n-1)), meaning that on average you will detect it (for large n) in about 0.7071*n... [incomplete]
For multiple characters that occur with different frequencies, or you make different assumptions on how characters are distributed in the string, you'll get different probabilities.
Hopefully someone can extend on my answer! :)
If the string is sorted, then you don't need the HashSet.
However, the average case still depends on the distribution of characters in the string: if you get two aa in the beggining, it's pretty efficient; if you get two zz, then you didn't win anything.
The worst case is sorting plus detecting-duplicates, so O(n log n + n), or just O(n log n).
So, it appears it's not advantageous to sort the string beforehand, due to the increased complexity, both in average-case and worst-case.

Complexities of the following -insertion sort, selection sort ,merge sort , radix sort and explain which one is best sorting algorithm and why?

Complexities of the following -insertion sort, selection sort ,merge sort , radix sort and explain which one is best sorting algorithm and why?

I don't believe in the 'best' sorting algorithm. It depends on what you want to do. For instance, bubble sort is really easy to implement and would be the best if you just want a quick and dirty way of sorting a short array. On the other hand, for larger arrays, the time complexity will really come into play and you will notice considerable runtime difference. If you really value memory, then you probably want to evaluate space complexities of these.
So the sort answer is: IMHO, there's no best sorting algorithm. I'll leave the following table for you to evaluate for yourself what you want to use.
Sorting AlgorithmAvg Time ComplextitySpace Complexity
Quicksort O(nlog(n)) O(log(n))
Mergesort O(nlog(n)) O(n)
Insertionsort O(n^2) O(1)
Selectionsort O(n^2) O(1)
Radixsort O(nk) O(n+k)

How could you sort string words in low level?

Of course there are handy library functions for all kind of languages to sort strings out. However I am interested in to know the low level details of string sorting. Mt naive belief is to use ASCII values of strings to convert the problem into numerical sorting. However, if the string word are larger than a single character then the thing is little complicated for me. What is the state of art sorting approach for multi-character sorting ?

Strings are typically just sorted with a comparison-based sorting algorithm, such as quick-sort or merge-sort (I know of a few libraries that does this, and I'd assume most would, although there can certainly be exceptions).
But you could indeed convert your string to a numeric value and use a distribution sort, such as counting-, bucket- or radix-sort, instead.
But there's no silver bullet here - the best solution will largely depend on the scenario - it's really just something you have to benchmark with the sorting implementations you're using, on the system you're working, with your typical data.

Naive sorting of ASCII strings is naive because it basically treats the strings as numbers written in base-128 or base-256. Dictionaries are meant for human usage and sort strings according to more complex criteria.
A pretty elaborate example is the 'Unicode Technical Standard #10' - Unicode Collation Algorithm.

Looking for ideas: lexicographically sorted suffix array of many different strings compute efficiently an LCP array

I don't want a direct solution to the problem that's the source of this question but it's this one link:
So I take in the strings and add them to a suffix array which is implemented as a sorted set internally, what I obtain then is a lexicographically sorted list of the two given strings.
S1 = "banana"
S2 = "panama"
SuffixArray.add S1, S2
To make searching for the k-th smallest substring efficient I preprocess this sorted set to add in information about the longest common prefix between a suffix and it's predecessor as well as keeping tabs on a cumulative substrings count. So I know that for a given k greater than the cumulative substrings count of the last item, it's an invalid query.
This works really well for small inputs as well as random large inputs of the constraints given in the problem definition, which is at most 50 strings of length 2000. I am able to pass the 4 out of 7 cases and was pretty surprised I didn't get them all.
So I went searching for the bottleneck and it hit me. Given large number of inputs like these
anananananananana.....ananana
bkbkbkbkbkbkbkbkb.....bkbkbkb
The queries for k-th smallest substrings are still fast as expected but not the way I preprocess the sorted set... The way I calculate the longest common prefix between the elements of the set is not efficient and linear O(m), like this, I did the most naïve thing expecting it to be good enough:
m = anananan
n = anananana
Start at 0 and find the point where `m[i] != n[i]`
It is like this because a suffix and his predecessor might no be related (i.e. coming from different input strings) and so I thought I couldn't help but using brute force.
Here is the question then and where I ended up reducing the problem as. Given a list of lexicographically sorted suffix like in the manner I described above (made up of multiple strings):
What is an efficient way of computing the longest common prefix array?.
The subquestion would then be, am I completely off the mark in my approach? Please propose further avenues of investigation if that's the case.
Foot note, I do not want to be shown implemented algorithm and I don't mind to be told to go read so and so book or resource on the subject as that is what I do anyway while attempting these challenges.
Accepted answer will be something that guides me on the right path or in the case that that fails; something that teaches me how to solve these types of problem in a broader sense, a book or something

READING
I would recommend this tutorial pdf from Stanford.
This tutorial explains a simple O(nlog^2n) algorithm with O(nlogn) space to compute suffix array and a matrix of intermediate results. The matrix of intermediate results can be used to compute the longest common prefix between two suffixes in O(logn).
HINTS
If you wish to try to develop the algorithm yourself, the key is to sort the strings based on their 2^k long prefixes.
From the tutorial:
Let's denote by A(i,k) be the subsequence of A of length 2^k starting at position i.
The position of A(i,k) in the sorted array of A(j,k) subsequences (j=1,n) is kept in P(k,i).
and
Using matrix P, one can iterate descending from the biggest k down to 0 and check whether A(i,k) = A(j,k). If the two prefixes are equal, a common prefix of length 2^k had been found. We only have left to update i and j, increasing them both by 2^k and check again if there are any more common prefixes.

Best way to sort a long list of strings

I would like to know the best way to sort a long list of strings wrt the time and space efficiency. I prefer time efficiency over space efficiency.
The strings can be numeric, alpha, alphanumeric etc. I am not interested in the sort behavior like alphanumeric sort v/s alphabetic sort just the sort itself.
Some ways below that I can think of.
Using code ex: .Net framework's Arrays.Sort() function. I think the way this works is that the hashcodes for the strings are calculated and the string is inserted at the proper position using a binary search.
Using the database (ex: MS-sql). I have not done this. I do not know how efficient this would be though.
Using a prefix tree data structure like a trie. Sorting requires traversing all the trieNodes of the trie tree using DFS (depth first search) - O(|V| + |E|) time. (Searching takes O(l) time where l is the length of the string to compare).
Any other ways or data structures?

You say that you have a database, and presumably the strings are stored in the database. Then you should get the database to do the work for you. It may be able to take advantage of an index and therefore not need to actually sort the list, but just read it from the index in sorted order.
If there is no index the database might still be able to help you. If you only fetch the first k rows for some small constant number k, for example 100. When you use ORDER BY with a LIMIT clause it allows SQL Server to use a special optimization called TOP N SORT which runs in linear time instead of O(n log(n)) time.
If your strings are not in the database already then you should use the features provided by .NET instead. I think it is unlikely you will be able to write custom code that will be much faster than the default sort.

I found this paper that uses trie data structure to efficiently sort large sets of strings. I have not looked into it in detail though.

Radix sort could also be good option if strings are not very long e.g. list of names

Let us suppose you have a large list of strings and that the length of the List is N.
Using a comparison based sorting algorithm like MergeSort, HeapSort or Quicksort will give you an
where n is the size of the list and d is the maximum length for all strings in the list.
We can try to use Radix sort in this case. Let b be the base and let d be the length of the maximum string then we can show that the running time using radix sort is .
Furthermore, if the strings are say the lower case English Alphabets the running time is
Source: MIT Opencourse Algorithms lecture by prof. Eric Demaine.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string