fast, semi-accurate sort in linux - linux

I'm going through a huge list of files in Linux, the output of a "find" (directory walk). I want to sort the list by filename, but I'd like to begin processing the files as soon as possible.
I don't need the sort to be 100% correct.
How can I do a "partial sort", that might be off some of the time but will output quickly?

This is StackOverflow, not SuperUser, so an algorithm answer should be enough for you.
Try implementing HeapSort. But instead of sorting the full list of names, do the following.
Pick a constant M. The smaller it is, the more "off" it will be and the "faster" the algorithm will start printing the results. In the limiting case where M is equal to the number of all names, it will be an exact sorting algorithm.
Load the first M elements, heapify() them.
Take the lowest element from the heap, print it. Put next unsorted name into its place, then do siftDown().
Repeat until you run out of unsorted names. Do a standard HeapSort on the elements left in the heap.
This algorithm will be linear in number of names and will start printing the names as soon as the first M of them will be read. Step 2 is O(M) == O(1). Step 3 is O(log M) == O(1), it is repeated O(N) times, hence total is O(N).
This algorithm will try to keep the large elements in the heap as long as possible while pushing the lowest elements from the heap as quickly as possible. Hence the output will look as if it was almost sorted.
IIRC, a variant of this algorithm is actually what GNU sort does before switching to on-disk MergeSort to keep sorted runs of data as long as possible and minimize number of on-disk merges.

Related

Linux sort vs programming

I'm trying to grasp why my software (golang) is 350 times slower compared to linux sort command? I was sorting UTF-8 text file around 13.000.000 lines (4 - 20 bytes long).
code sample from my function (if checkDupl false append to newArray):
func checkDupl(in []byte) bool {
for i := range newArray {
if bytes.Equal(in, newArray[i]) {
return true
}
}
return false
}
This code finished around 25% over night.
This code finished in 8min:
497 export LC_ALL=C
498 time sort -us -o file_unique.txt file.txt
sort -u works by sorting the input, then iterating through and printing out each unique element. It can do that just by remembering which was the last thing it printed, and printing a new item whenever it changes.
Your code appears to be a linear search of the output array, so I assume it's part of a wider algorithm something like this:
for each X in input:
if not checkDupl(X) then:
append X to newArray
That means your checkDupl function runs once for every item in the input, and then the loop inside checkDupl runs once for every item in the output. In the worst case, the whole list is unique, so checkDupl looks at one item the first time, then two, then three, then four, .... That sequence adds up to n(n + 1) / 2, or 0.5n^2 + 0.5n. 13,000,000 squared dominates the 6.5 million of the other term, so we call that algorithm "quadratic time", or O(n^2). That's the worst case, and also an average case (but your best case, 13,000,000 identical lines, will be fairly quick).
There are many conventional sorting algorithms that work in O(n log n) time. POSIX does not require sort to use one of those, but all sensible implementations will do so. The log(n) term grows very slowly, so this will be much less than n^2. The printing is linear time, O(n), and can be ignored for the same reason as above.
Your program will take much longer to run than sort in all but the most trivial cases, for all but the most stupid of sorts. For your thirteen million items the difference could be hundreds of thousands of times (ignoring everything else about the programs).
You could implement a sorting algorithm and replicate sort's approach, or use a library function. You could also use a data structure more suited to checking uniquity, like a hash table, rather than an array that requires a linear search. Most likely, it'll be better to use library functions than to try to roll everything yourself.

finding element in very big list in less than O(n)

I want to check if an element exists in a list (a very big one in 10,000,000 order) in a O(1) instead of O(n). Lists with elem x ys take O(n)
So i want to use another data type/constructor but it has to be in Prelude(not Array); any suggestions? And if i have to build me data type what it would be like?
Also to sort a big list of numbers in the same order (10,000,000)and indexing an element in the shortest time possible.
The only way to search for an item in a data set in O(1) time is if you already know where it is, but then you don't need to search for it. For unsorted data, search is O(n) time. For sorted data, search is O(log n) time.
You should use either Bloom filter or Hashtable. Neither of them is in Prelude; moreover, both rely on Array to be available.
The only left option is some kind of tree; I would suggest heap. It’s not hard to implement and it also gives you sorting for free.
UPDATE: oops! I have forgotten that heap doesn’t provide lookup. BST is your choice, then.

Best way to sort a long list of strings

I would like to know the best way to sort a long list of strings wrt the time and space efficiency. I prefer time efficiency over space efficiency.
The strings can be numeric, alpha, alphanumeric etc. I am not interested in the sort behavior like alphanumeric sort v/s alphabetic sort just the sort itself.
Some ways below that I can think of.
Using code ex: .Net framework's Arrays.Sort() function. I think the way this works is that the hashcodes for the strings are calculated and the string is inserted at the proper position using a binary search.
Using the database (ex: MS-sql). I have not done this. I do not know how efficient this would be though.
Using a prefix tree data structure like a trie. Sorting requires traversing all the trieNodes of the trie tree using DFS (depth first search) - O(|V| + |E|) time. (Searching takes O(l) time where l is the length of the string to compare).
Any other ways or data structures?
You say that you have a database, and presumably the strings are stored in the database. Then you should get the database to do the work for you. It may be able to take advantage of an index and therefore not need to actually sort the list, but just read it from the index in sorted order.
If there is no index the database might still be able to help you. If you only fetch the first k rows for some small constant number k, for example 100. When you use ORDER BY with a LIMIT clause it allows SQL Server to use a special optimization called TOP N SORT which runs in linear time instead of O(n log(n)) time.
If your strings are not in the database already then you should use the features provided by .NET instead. I think it is unlikely you will be able to write custom code that will be much faster than the default sort.
I found this paper that uses trie data structure to efficiently sort large sets of strings. I have not looked into it in detail though.
Radix sort could also be good option if strings are not very long e.g. list of names
Let us suppose you have a large list of strings and that the length of the List is N.
Using a comparison based sorting algorithm like MergeSort, HeapSort or Quicksort will give you an
where n is the size of the list and d is the maximum length for all strings in the list.
We can try to use Radix sort in this case. Let b be the base and let d be the length of the maximum string then we can show that the running time using radix sort is .
Furthermore, if the strings are say the lower case English Alphabets the running time is
Source: MIT Opencourse Algorithms lecture by prof. Eric Demaine.

searching and sorting

If the list has 1024 items (lg1024 = 10) at what point (the number of searches) does sorting the list first and using binary search pay off? How does your answer change if the list has 2048 items? instead of using sequential search
Where the "linear access" curve crosses the "binary search" curve depends on how long it takes to access/insert a single item versus how many items there are. This will be different for every combination of compiler, memory and cpu architecture, type of data/node in the list, the distribution of data values, what sort and insertion algorithms you use, etc... But with a "large enough" set of items, the running time can be described by mentioning how its upper bound grows with increasing number of items, even though that "Big-O" bound may not precisely describe any particular run.
You can figure out precisely if you can know the specific algorithm you will insert or search with, and determine the actual instructions that make up your list accesses, and find out how many clock cycles they take to execute, etc etc...
Then you can say for sure which one is faster, and at which point. And if you know you data values, you can model it. But if you don't know, you have to assume (for example, what if your inserted data values are already ordered? how does that affect your sort or insertion function?)
For example, a single item retrieval may take 1us. Comparing two items may take 0.5us. Doing a sorted list insertion with 100 items in the list might require X number of retrievals, Y number of compares, and Z number of updates/writes.... Whereas an unordered list might require more or less depending on what's already there and what you're inserting.
if your list is unsorted it will take O(n) to find it. Sort with quicksort costs O(n*log n), then binary search is O(log n). Lets assume that x is number of searchs. x * n = x * logn + n * logn . by putting different values you can estimate the dynamics. my rough estimate tells that if n = 1024 and number searches is greater then ~10, it is more efficitent to sort first. put 1024 instead of n and try.

How to find high frequency words in a book in an environment low on memory?

Recently in a technical interview, I was asked to write a program to find the high frequency words(Words which appear maximum number of times) in a text book. The program should be designed in such a way that, it processes the entire text book with minimum memory. Performance is not a concern. I was able to program to find the frequency of words, but it consumed a lot of memory.
How do you make this operation less memory intensive? Any strategies/solutions?
-Snehal
You probably used hash tables which are memory-intensive but have a constant-lookup time--so the performance/memory trade off is obvious. By the time you reach the end of the book you will know your answer. Also, incrementing counters for each word is fast (because of the quick hashtable lookups).
The other end of the spectrum is to look at the first word, then go through the entire book to see how many times that word occurs. This requires minimal memory. Then you do the same for the next word and go through the entire book. If that word occurs more times, you add that as the top word (or top N words). Of course, this is extremely inefficient--if the first and third word are the same you'll end up going through the whole book again even though you just did the same thing for the first word.
OK, if you're only interested in the highest n occurring words, one way to do it is in two passes, with the first pass based on a modified Bloom Filter. Instead of using a bit map to track hash occurrences, use an integer array instead - either byte, 16 bit, 32 bit or even 64 bit depending on your input size. Where a Bloom filter simply sets the bit corresponding to each of the hash values of a word, you'll increment the count at the hash index in the array.
The problem with this approach is that two words will probably give the same hash values. So you need to do a second pass where you ignore words unless their hash totals are above a certain threshold, thus reducing the amount of memory you need to allocate to do accurate counting.
So just create a bit map with bits set for the highest occurring hash values. Then in the second pass of the words, if a word has "hits" in the bitmap for its hashes, look it up or add it to a hash table and increment its count. This minimises memory usage by creating a hash table of only the highest occurring words.
I'm a physicist, so my favourite approach is to approximate. You don't need to go through the entire text to get the most frequent words. Instead:
parse a chunk small enough to allow for your memory limitations,
skip a random amount of text,
repeat, combining accumulated results.
Stop when the list has satisfactorily converged.
If you use a memory-efficient algorithm for the smaller chunks (e.g. sorting) then you can get far faster performance than even the most efficient algorithm that reads every word.
Note: This does make the assumption that the most frequent words do occur most frequently throughout the text, not just at one place in the text. For english text, this assumption is true, because of the frequency of words like 'the' etc throughout. If you're worried about this requirement, require the algorithm to complete at least one pass of the entire text.
I'll probably get down-voted for this...
If the text is English and you just want to find the top 5 most frequent words, here is your program:
print "1. the\n";
print "2. of\n";
print "3. and\n";
print "4. a\n";
print "5. to\n";
Runs fast and consumes minimal memory!
If performance is really of no concern you could just go through each word in turn, check if it's in your "top N" and, if it isn't, count all it's occurrences. This way you're only storing N values. Of course, you'd be counting the same words many times, but, as you said, performance isn't an issue - and the code would be trivial (which is generally preferable - all other things being equal).
One way would be to sort the list first.
We can sort the words in-place without a lot of memory (traded with slow performance).
And then we can have a simple counting loops that finds words with maximum frequency without having to save everything in memory since they're in sorted form.
Do you mean a lot of process memory? If so, one way would be to use the disk as virtual memory (aka write a filesystem wrapper).
A possible solution is to use a trie data structure for storing all words associated to their number of occurrences.
Other solutions may be found in answers to this related question: Space-Efficient Data Structure for Storing a Word List?
Like many good interview questions, the question is phrased a little ambiguously/imprecisely, to force the interviewee to ask clarifying questions and state assumptions. I think a number of the other answers here are good, as they poke at these assumptions and demonstrate big-picture understanding.
I'm assuming the text is stored 'offline' somewhere, but there is a way to iterate over each word in the text without loading the whole text into memory.
Then the F# code below find the top N words. It's only data structure is a mapping of key-value pairs (word, frequency), and it only keeps the top N of those, so the memory use is O(N), which is small. The runtime is O(numWordsInText^2), which is poor, but acceptable given the problem constraints. The gist of the algorithm is straightforward, for each word in the text, count how many times it occurs, and if it's in the running best-N, then add it to the list and remove the previous minimum entry.
Note that the actual program below loads the entire text into memory, merely for convenience of exposition.
#light
// some boilerplate to grab a big piece of text off the web for testing
open System.IO
open System.Net
let HttpGet (url: string) =
let req = System.Net.WebRequest.Create(url)
let resp = req.GetResponse()
let stream = resp.GetResponseStream()
let reader = new StreamReader(stream)
let data = reader.ReadToEnd()
resp.Close()
data
let text = HttpGet "http://www-static.cc.gatech.edu/classes/cs2360_98_summer/hw1"
let words = text.Split([|' ';'\r';'\n'|], System.StringSplitOptions.RemoveEmptyEntries)
// perhaps 'words' isn't actually stored in memory, but so long as we can
// 'foreach' over all the words in the text we're good
let N = 5 // how many 'top frequency' words we want to find
let FindMin map =
// key-value pair with mininum value in a map
let (Some(seed)) = Map.first (fun k v -> Some(k,v)) map
map |> Map.fold_left
(fun (mk,mv) k v -> if v > mv then (mk,mv) else (k,v))
seed
let Main() =
let mutable freqCounts = Map.of_list [ ("",0) ]
for word in words do
let mutable count = 0
for x in words do
if x = word then
count <- count + 1
let minStr,minCount = FindMin freqCounts
if count >= minCount then
freqCounts <- Map.add word count freqCounts
if Seq.length freqCounts > N then
freqCounts <- Map.remove minStr freqCounts
freqCounts
|> Seq.sort_by (fun (KeyValue(k,v)) -> -v)
|> Seq.iter (printfn "%A")
Main()
Output:
[the, 75]
[to, 41]
[in, 34]
[a, 32]
[of, 29]
You could use combination of external merge sort and priority queue. Merge sort will make sure that your memory limits are honored and priority queue will maintain your top K searches. Obviously, the priority queue has to be small enough to fit into memory.
First, divide input strings into chunks, sort each chunk and store into secondary storage (external sorting) - O(n log n)
Read each chunk and within the chunk, calculate frequency of words, so at end of this step, each chunk is reduced to (unique word - frequency count) within the chunk. O(n)
Start reading elements across the chunks and aggregate for each word. Since chunks are sorted, you can do it in O(n)
Now, maintain a min priority heap (top of the heap is minimum element in the heap) of K elements. Populate priority heap by first K elements then for next (unique word -final count), if its count is greater than top element in the heap, the pop top and push current word. O(n log k)
So your final time complexity is O(n(log k + log n))
-
Well, if you want absolutely terrible performance...
Take the first word in the book, and count how many times it occurs. Take the second word in the book, count how many times it occurs. If it's more than the last word, discard the last word. And so forth... you'll end up counting the same words multiple times unless you keep a list of them somewhere, but if you really want to minimize memory, this should only require a few ints. Should run in O(n^2) time, where n is the number of words in the book.
How about create a binary tree of word keys ( as you keep reading the words from the file ). This helps to search the already repeated words in O(Log(n)). So finally you get O(nLog(n)) for top word search.
Basic algo would be
for each word in a file:
Create unique key for a given word ( weighted ascii char e.g. "bat" could be 1*'b' + 2*'a' + 3*'c';
Add this word to the tree. If the word already exists increment the new count.
Feed the word and the current count to maintainTop5(word, count). maintainTop5() maintains a dynamic list of top5 counts and associated words.
End of the file you have top 5 words.

Resources