Linux sort vs programming - linux

I'm trying to grasp why my software (golang) is 350 times slower compared to linux sort command? I was sorting UTF-8 text file around 13.000.000 lines (4 - 20 bytes long).
code sample from my function (if checkDupl false append to newArray):
func checkDupl(in []byte) bool {
for i := range newArray {
if bytes.Equal(in, newArray[i]) {
return true
}
}
return false
}
This code finished around 25% over night.
This code finished in 8min:
497 export LC_ALL=C
498 time sort -us -o file_unique.txt file.txt

sort -u works by sorting the input, then iterating through and printing out each unique element. It can do that just by remembering which was the last thing it printed, and printing a new item whenever it changes.
Your code appears to be a linear search of the output array, so I assume it's part of a wider algorithm something like this:
for each X in input:
if not checkDupl(X) then:
append X to newArray
That means your checkDupl function runs once for every item in the input, and then the loop inside checkDupl runs once for every item in the output. In the worst case, the whole list is unique, so checkDupl looks at one item the first time, then two, then three, then four, .... That sequence adds up to n(n + 1) / 2, or 0.5n^2 + 0.5n. 13,000,000 squared dominates the 6.5 million of the other term, so we call that algorithm "quadratic time", or O(n^2). That's the worst case, and also an average case (but your best case, 13,000,000 identical lines, will be fairly quick).
There are many conventional sorting algorithms that work in O(n log n) time. POSIX does not require sort to use one of those, but all sensible implementations will do so. The log(n) term grows very slowly, so this will be much less than n^2. The printing is linear time, O(n), and can be ignored for the same reason as above.
Your program will take much longer to run than sort in all but the most trivial cases, for all but the most stupid of sorts. For your thirteen million items the difference could be hundreds of thousands of times (ignoring everything else about the programs).
You could implement a sorting algorithm and replicate sort's approach, or use a library function. You could also use a data structure more suited to checking uniquity, like a hash table, rather than an array that requires a linear search. Most likely, it'll be better to use library functions than to try to roll everything yourself.

Related

How long does it take to crack a hash?

I want to calculate the time it will take to break a SHA-256 hash. So I research and found the following calculation. If I have a password in lower letter with a length of 6 chars, I would have 26^6passwords right?
To calculate the time I have to divide this number by a hashrate, I guess. So if I had one RTX 3090, the hashrate would be 120 MH/s (1.2*10^8 H/s) and than I need to calculate 26^6/(1.2*10^8) to get the time in seconds right?
Is this idea right or wrong?
Yes, but a lowercase-latin 6 character string is also short enough that you would expect to compute this one time and put it into a database so that you could look it up in O(1). It's only a bit over 300M entries. That said, given you're 50% likely to find the answer in the first half of your search, it's so fast to crack that you might not even bother unless you were doing this often. You don't even need a particularly fancy GPU for something on this scale.
Note that in many cases a 6 character string can also be a 5 character string, so you need to add 26^6 + 26^5 + 26^4 + ..., but all of these together only raises this to around 320M hashes. It's a tiny space.
Adding uppercase, numbers and the easily typed symbols gets you up to 96^6 ~ 780B. On the other hand, adding just 3 more lowercase-letters (9 total) gets you to 26^9 ~ 5.4T. For brute force on random strings, longer is much more powerful than complicated.
To your specific question, note that it does matter how you implement this. You won't get these kinds of hash rates if you don't write your code in a way to maximize the GPU. For example, writing simple code that sends one value to the GPU to hash at a time, and then compares the result on the CPU could be incredibly slow (in some cases slower than just doing all the work on a CPU). Setting up your memory efficiently and maximizing things the GPU can do in parallel are very important. If you're not familiar with this kind of programming, I recommend using or studying a tool like John the Ripper.

Finding the most similar string among a set of millions of strings

Let's say I have a dictionary (word list) of millions upon millions of words. Given a query word, I want to find the word from that huge list that is most similar.
So let's say my query is elepant, then the result would most likely be elephant.
If my word is fentist, the result will probably be dentist.
Of course assuming both elephant and dentist are present in my initial word list.
What kind of index, data structure or algorithm can I use for this so that the query is fast? Hopefully complexity of O(log N).
What I have: The most naive thing to do is to create a "distance function" (which computes the "distance" between two words, in terms of how different they are) and then in O(n) compare the query with every word in the list, and return the one with the closest distance. But I wouldn't use this because it's slow.
The problem you're describing is a Nearest Neighbor Search (NNS). There are two main methods of solving NNS problems: exact and approximate.
If you need an exact solution, I would recommend a metric tree, such as the M-tree, the MVP-tree, and the BK-tree. These trees take advantage of the triangle inequality to speed up search.
If you're willing to accept an approximate solution, there are much faster algorithms. The current state of the art for approximate methods is Hierarchical Navigable Small World (hnsw). The Non-Metric Space Library (nmslib) provides an efficient implementation of hnsw as well as several other approximate NNS methods.
(You can compute the Levenshtein distance with Hirschberg's algorithm)
I made similar algorythm some time ago
Idea is to have an array char[255] with characters
and values is a list of words hashes (word ids) that contains this character
When you are searching 'dele....'
search(d) will return empty list
search(e) will find everything with character e, including elephant (two times, as it have two 'e')
search(l) will brings you new list, and you need to combine this list with results from previous step
...
at the end of input you will have a list
then you can try to do group by wordHash and order by desc by count
Also intresting thing, if your input is missing one or more characters, you will just receive empty list in the middle of the search and it will not affect this idea
My initial algorythm was without ordering, and i was storing for every character wordId and lineNumber and char position.
My main problem was that i want to search
with ee to find 'elephant'
with eleant to find 'elephant'
with antph to find 'elephant'
Every words was actually a line from file, so it's often was very long
And number of files and lines was big
I wanted quick search for directories with more than 1gb text files
So it was a problem even store them in memory, for this idea you need 3 parts
function to fill your cache
function to find by char from input
function to filter and maybe order results (i didn't use ordering, as i was trying to fill my cache in same order as i read the file, and i wanted to put lines that contains input in the same order upper )
I hope it make sense

Collections: How will you find the top 10 longest strings in a list of a billion strings?

I was recently asked a question in an interview. How will you find the top 10 longest strings in a list of a billion strings?
My Answer was that we need to write a Comparator that compares the lengths of 2 strings and then Use the TreeSet(Comparator) constructor.
Once you start adding the strings in the Treeset it will sort as per the sorting order of the comparator defined.
Then just pop the top 10 elements of the Treeset.
The Interviewer wasn't happy with that. The argument was that, to hold billion strings I will have to use a super computer.
Is there any other data stucture than can deal with this kind of data?
Given what you stated about the interviewer saying you would need a super computer, I am going to assume that the strings would come in a stream one string at a time.
Given the immense size due to no knowledge of how large the individual strings are (they could be whole books), I would read them in one at a time from the stream. I would then compare the current string to an ordered list of the top ten longest strings found before it and place it accordingly in the ordered list. I would then remove the smallest length one from the list and proceed to read the next string. That would mean only 11 strings were being stored at one time, the current top 10 and the one currently being processed.
Most languages have a built in sort that is pretty speedy.
stringList.sort(key=len)
in python would work. Then just grab the first 10 elements.
Also your interviewer does sounds behind the times. One billion strings is pretty small now a days
I remember studying similar data structure for such scenarios called as Trie
The height of the tree will give the longest string always.
A special kind of trie, called a suffix tree, can be used to index all suffixes in a text in order to carry out fast full text searches.
The point is you do not need to STORE all strings.
Let's think a simplified version: Find the longest 2 string (assuming no tie case)
You can always do a online algorithm like using 2 variables s1 & s2, where s1 is longest string you encountered so far, s2 is the second longest
Then you use O(N) to read the strings one by one, replace s1 or s2 when it can. This use O(2N) = O(N)
For top 10 strings, it is as dumb as the top 2 case. You can still do it in O(10N) = O(N) and store only 10 strings.
There is a faster way describe as follow but for given constant like 2 or 10, you may not need it.
For top-K strings in general, you can use structure like set in C++ (with longer having higher priority) to store the top-K strings, when a new string comes, you simply insert it, and remove the last one, both use O(lg K). So total you can do it in O(N lg K) with O(K) space.

fast, semi-accurate sort in linux

I'm going through a huge list of files in Linux, the output of a "find" (directory walk). I want to sort the list by filename, but I'd like to begin processing the files as soon as possible.
I don't need the sort to be 100% correct.
How can I do a "partial sort", that might be off some of the time but will output quickly?
This is StackOverflow, not SuperUser, so an algorithm answer should be enough for you.
Try implementing HeapSort. But instead of sorting the full list of names, do the following.
Pick a constant M. The smaller it is, the more "off" it will be and the "faster" the algorithm will start printing the results. In the limiting case where M is equal to the number of all names, it will be an exact sorting algorithm.
Load the first M elements, heapify() them.
Take the lowest element from the heap, print it. Put next unsorted name into its place, then do siftDown().
Repeat until you run out of unsorted names. Do a standard HeapSort on the elements left in the heap.
This algorithm will be linear in number of names and will start printing the names as soon as the first M of them will be read. Step 2 is O(M) == O(1). Step 3 is O(log M) == O(1), it is repeated O(N) times, hence total is O(N).
This algorithm will try to keep the large elements in the heap as long as possible while pushing the lowest elements from the heap as quickly as possible. Hence the output will look as if it was almost sorted.
IIRC, a variant of this algorithm is actually what GNU sort does before switching to on-disk MergeSort to keep sorted runs of data as long as possible and minimize number of on-disk merges.

How to find high frequency words in a book in an environment low on memory?

Recently in a technical interview, I was asked to write a program to find the high frequency words(Words which appear maximum number of times) in a text book. The program should be designed in such a way that, it processes the entire text book with minimum memory. Performance is not a concern. I was able to program to find the frequency of words, but it consumed a lot of memory.
How do you make this operation less memory intensive? Any strategies/solutions?
-Snehal
You probably used hash tables which are memory-intensive but have a constant-lookup time--so the performance/memory trade off is obvious. By the time you reach the end of the book you will know your answer. Also, incrementing counters for each word is fast (because of the quick hashtable lookups).
The other end of the spectrum is to look at the first word, then go through the entire book to see how many times that word occurs. This requires minimal memory. Then you do the same for the next word and go through the entire book. If that word occurs more times, you add that as the top word (or top N words). Of course, this is extremely inefficient--if the first and third word are the same you'll end up going through the whole book again even though you just did the same thing for the first word.
OK, if you're only interested in the highest n occurring words, one way to do it is in two passes, with the first pass based on a modified Bloom Filter. Instead of using a bit map to track hash occurrences, use an integer array instead - either byte, 16 bit, 32 bit or even 64 bit depending on your input size. Where a Bloom filter simply sets the bit corresponding to each of the hash values of a word, you'll increment the count at the hash index in the array.
The problem with this approach is that two words will probably give the same hash values. So you need to do a second pass where you ignore words unless their hash totals are above a certain threshold, thus reducing the amount of memory you need to allocate to do accurate counting.
So just create a bit map with bits set for the highest occurring hash values. Then in the second pass of the words, if a word has "hits" in the bitmap for its hashes, look it up or add it to a hash table and increment its count. This minimises memory usage by creating a hash table of only the highest occurring words.
I'm a physicist, so my favourite approach is to approximate. You don't need to go through the entire text to get the most frequent words. Instead:
parse a chunk small enough to allow for your memory limitations,
skip a random amount of text,
repeat, combining accumulated results.
Stop when the list has satisfactorily converged.
If you use a memory-efficient algorithm for the smaller chunks (e.g. sorting) then you can get far faster performance than even the most efficient algorithm that reads every word.
Note: This does make the assumption that the most frequent words do occur most frequently throughout the text, not just at one place in the text. For english text, this assumption is true, because of the frequency of words like 'the' etc throughout. If you're worried about this requirement, require the algorithm to complete at least one pass of the entire text.
I'll probably get down-voted for this...
If the text is English and you just want to find the top 5 most frequent words, here is your program:
print "1. the\n";
print "2. of\n";
print "3. and\n";
print "4. a\n";
print "5. to\n";
Runs fast and consumes minimal memory!
If performance is really of no concern you could just go through each word in turn, check if it's in your "top N" and, if it isn't, count all it's occurrences. This way you're only storing N values. Of course, you'd be counting the same words many times, but, as you said, performance isn't an issue - and the code would be trivial (which is generally preferable - all other things being equal).
One way would be to sort the list first.
We can sort the words in-place without a lot of memory (traded with slow performance).
And then we can have a simple counting loops that finds words with maximum frequency without having to save everything in memory since they're in sorted form.
Do you mean a lot of process memory? If so, one way would be to use the disk as virtual memory (aka write a filesystem wrapper).
A possible solution is to use a trie data structure for storing all words associated to their number of occurrences.
Other solutions may be found in answers to this related question: Space-Efficient Data Structure for Storing a Word List?
Like many good interview questions, the question is phrased a little ambiguously/imprecisely, to force the interviewee to ask clarifying questions and state assumptions. I think a number of the other answers here are good, as they poke at these assumptions and demonstrate big-picture understanding.
I'm assuming the text is stored 'offline' somewhere, but there is a way to iterate over each word in the text without loading the whole text into memory.
Then the F# code below find the top N words. It's only data structure is a mapping of key-value pairs (word, frequency), and it only keeps the top N of those, so the memory use is O(N), which is small. The runtime is O(numWordsInText^2), which is poor, but acceptable given the problem constraints. The gist of the algorithm is straightforward, for each word in the text, count how many times it occurs, and if it's in the running best-N, then add it to the list and remove the previous minimum entry.
Note that the actual program below loads the entire text into memory, merely for convenience of exposition.
#light
// some boilerplate to grab a big piece of text off the web for testing
open System.IO
open System.Net
let HttpGet (url: string) =
let req = System.Net.WebRequest.Create(url)
let resp = req.GetResponse()
let stream = resp.GetResponseStream()
let reader = new StreamReader(stream)
let data = reader.ReadToEnd()
resp.Close()
data
let text = HttpGet "http://www-static.cc.gatech.edu/classes/cs2360_98_summer/hw1"
let words = text.Split([|' ';'\r';'\n'|], System.StringSplitOptions.RemoveEmptyEntries)
// perhaps 'words' isn't actually stored in memory, but so long as we can
// 'foreach' over all the words in the text we're good
let N = 5 // how many 'top frequency' words we want to find
let FindMin map =
// key-value pair with mininum value in a map
let (Some(seed)) = Map.first (fun k v -> Some(k,v)) map
map |> Map.fold_left
(fun (mk,mv) k v -> if v > mv then (mk,mv) else (k,v))
seed
let Main() =
let mutable freqCounts = Map.of_list [ ("",0) ]
for word in words do
let mutable count = 0
for x in words do
if x = word then
count <- count + 1
let minStr,minCount = FindMin freqCounts
if count >= minCount then
freqCounts <- Map.add word count freqCounts
if Seq.length freqCounts > N then
freqCounts <- Map.remove minStr freqCounts
freqCounts
|> Seq.sort_by (fun (KeyValue(k,v)) -> -v)
|> Seq.iter (printfn "%A")
Main()
Output:
[the, 75]
[to, 41]
[in, 34]
[a, 32]
[of, 29]
You could use combination of external merge sort and priority queue. Merge sort will make sure that your memory limits are honored and priority queue will maintain your top K searches. Obviously, the priority queue has to be small enough to fit into memory.
First, divide input strings into chunks, sort each chunk and store into secondary storage (external sorting) - O(n log n)
Read each chunk and within the chunk, calculate frequency of words, so at end of this step, each chunk is reduced to (unique word - frequency count) within the chunk. O(n)
Start reading elements across the chunks and aggregate for each word. Since chunks are sorted, you can do it in O(n)
Now, maintain a min priority heap (top of the heap is minimum element in the heap) of K elements. Populate priority heap by first K elements then for next (unique word -final count), if its count is greater than top element in the heap, the pop top and push current word. O(n log k)
So your final time complexity is O(n(log k + log n))
-
Well, if you want absolutely terrible performance...
Take the first word in the book, and count how many times it occurs. Take the second word in the book, count how many times it occurs. If it's more than the last word, discard the last word. And so forth... you'll end up counting the same words multiple times unless you keep a list of them somewhere, but if you really want to minimize memory, this should only require a few ints. Should run in O(n^2) time, where n is the number of words in the book.
How about create a binary tree of word keys ( as you keep reading the words from the file ). This helps to search the already repeated words in O(Log(n)). So finally you get O(nLog(n)) for top word search.
Basic algo would be
for each word in a file:
Create unique key for a given word ( weighted ascii char e.g. "bat" could be 1*'b' + 2*'a' + 3*'c';
Add this word to the tree. If the word already exists increment the new count.
Feed the word and the current count to maintainTop5(word, count). maintainTop5() maintains a dynamic list of top5 counts and associated words.
End of the file you have top 5 words.

Resources