Hashmaps and keeping time complexity low - hashmap

I have a list of words, and a lot of posts. I have to check if any of those words are present in the post, if so, save it, and then go to the next, and so on.
I can change the list of n words in an hashmap so when I'll check I'll have O(1), but I have still to go through all the words in the single post [O(n)], and then to all the posts [O(n^2)].
Any suggestion to improve this?
Usually I wouldn't bother, but the number of posts is massive.

That's mostly right. You can't beat O(n + m) where n is the number of words in all of the posts and m is the length of target words. We assume O(1) to look up each word in the set/hash of target words m and O(m) to build the set.
But O(nn) doesn't look right because the size of the posts isn't quadratic and there's no relationship between the number of words in posts and the number of posts.
You could call it O(num_target_words + num_posts * max(num_words_in_post)) but this seems like an awkard way to characterize the problem that's mostly dependent on words, not on posts. So just O(n + m) seems clearest.
If m is constant and/or the set construction is guaranteed non-expensive, we can disregard it and just call it O(n). But maybe you have a huge bucket of potential words and just a few posts, in which case m dominates.

Related

Finding the most similar string among a set of millions of strings

Let's say I have a dictionary (word list) of millions upon millions of words. Given a query word, I want to find the word from that huge list that is most similar.
So let's say my query is elepant, then the result would most likely be elephant.
If my word is fentist, the result will probably be dentist.
Of course assuming both elephant and dentist are present in my initial word list.
What kind of index, data structure or algorithm can I use for this so that the query is fast? Hopefully complexity of O(log N).
What I have: The most naive thing to do is to create a "distance function" (which computes the "distance" between two words, in terms of how different they are) and then in O(n) compare the query with every word in the list, and return the one with the closest distance. But I wouldn't use this because it's slow.
The problem you're describing is a Nearest Neighbor Search (NNS). There are two main methods of solving NNS problems: exact and approximate.
If you need an exact solution, I would recommend a metric tree, such as the M-tree, the MVP-tree, and the BK-tree. These trees take advantage of the triangle inequality to speed up search.
If you're willing to accept an approximate solution, there are much faster algorithms. The current state of the art for approximate methods is Hierarchical Navigable Small World (hnsw). The Non-Metric Space Library (nmslib) provides an efficient implementation of hnsw as well as several other approximate NNS methods.
(You can compute the Levenshtein distance with Hirschberg's algorithm)
I made similar algorythm some time ago
Idea is to have an array char[255] with characters
and values is a list of words hashes (word ids) that contains this character
When you are searching 'dele....'
search(d) will return empty list
search(e) will find everything with character e, including elephant (two times, as it have two 'e')
search(l) will brings you new list, and you need to combine this list with results from previous step
...
at the end of input you will have a list
then you can try to do group by wordHash and order by desc by count
Also intresting thing, if your input is missing one or more characters, you will just receive empty list in the middle of the search and it will not affect this idea
My initial algorythm was without ordering, and i was storing for every character wordId and lineNumber and char position.
My main problem was that i want to search
with ee to find 'elephant'
with eleant to find 'elephant'
with antph to find 'elephant'
Every words was actually a line from file, so it's often was very long
And number of files and lines was big
I wanted quick search for directories with more than 1gb text files
So it was a problem even store them in memory, for this idea you need 3 parts
function to fill your cache
function to find by char from input
function to filter and maybe order results (i didn't use ordering, as i was trying to fill my cache in same order as i read the file, and i wanted to put lines that contains input in the same order upper )
I hope it make sense

Is Binary Search O(log n) or O(n log n)?

I have encountered times when it was O(log n), and times when it was O(n log n). This is also a very popular interview question. So when the interviewer blindly asks you what is the run time of a binary search (with no context)? What should you say?
Sounds like a trick question, since there is no context. Looks like interviewer wants to cover cases when binary search is good, and when its not.
So, binary search is great when you have sorted list of elements and you search for single element, and in such case it costs O(logn).
Now, if we don't have sorted array, cost of sorting it is O(n logn), and then you can apply first case. In such case, its better to place values in set or map and then search (execution time will be O(n) for inserting, O(1) for search).
Both of this cases rests on single search. Binary search is not for searching n elements in single execution (or any number of elements depending on n, like n/2 elements, n/4, or even logn elements - for fixed number its ok). For such cases, there are better ways (sets and maps).
O(log n), for average and worst case. Never heard someone claim it is O(n log n).

Looking for ideas: lexicographically sorted suffix array of many different strings compute efficiently an LCP array

I don't want a direct solution to the problem that's the source of this question but it's this one link:
So I take in the strings and add them to a suffix array which is implemented as a sorted set internally, what I obtain then is a lexicographically sorted list of the two given strings.
S1 = "banana"
S2 = "panama"
SuffixArray.add S1, S2
To make searching for the k-th smallest substring efficient I preprocess this sorted set to add in information about the longest common prefix between a suffix and it's predecessor as well as keeping tabs on a cumulative substrings count. So I know that for a given k greater than the cumulative substrings count of the last item, it's an invalid query.
This works really well for small inputs as well as random large inputs of the constraints given in the problem definition, which is at most 50 strings of length 2000. I am able to pass the 4 out of 7 cases and was pretty surprised I didn't get them all.
So I went searching for the bottleneck and it hit me. Given large number of inputs like these
anananananananana.....ananana
bkbkbkbkbkbkbkbkb.....bkbkbkb
The queries for k-th smallest substrings are still fast as expected but not the way I preprocess the sorted set... The way I calculate the longest common prefix between the elements of the set is not efficient and linear O(m), like this, I did the most naïve thing expecting it to be good enough:
m = anananan
n = anananana
Start at 0 and find the point where `m[i] != n[i]`
It is like this because a suffix and his predecessor might no be related (i.e. coming from different input strings) and so I thought I couldn't help but using brute force.
Here is the question then and where I ended up reducing the problem as. Given a list of lexicographically sorted suffix like in the manner I described above (made up of multiple strings):
What is an efficient way of computing the longest common prefix array?.
The subquestion would then be, am I completely off the mark in my approach? Please propose further avenues of investigation if that's the case.
Foot note, I do not want to be shown implemented algorithm and I don't mind to be told to go read so and so book or resource on the subject as that is what I do anyway while attempting these challenges.
Accepted answer will be something that guides me on the right path or in the case that that fails; something that teaches me how to solve these types of problem in a broader sense, a book or something
READING
I would recommend this tutorial pdf from Stanford.
This tutorial explains a simple O(nlog^2n) algorithm with O(nlogn) space to compute suffix array and a matrix of intermediate results. The matrix of intermediate results can be used to compute the longest common prefix between two suffixes in O(logn).
HINTS
If you wish to try to develop the algorithm yourself, the key is to sort the strings based on their 2^k long prefixes.
From the tutorial:
Let's denote by A(i,k) be the subsequence of A of length 2^k starting at position i.
The position of A(i,k) in the sorted array of A(j,k) subsequences (j=1,n) is kept in P(k,i).
and
Using matrix P, one can iterate descending from the biggest k down to 0 and check whether A(i,k) = A(j,k). If the two prefixes are equal, a common prefix of length 2^k had been found. We only have left to update i and j, increasing them both by 2^k and check again if there are any more common prefixes.

O(n^2) (or O(n^2lg(n)) ?)algorithm to calculate the longest common subsequence (LCS) of two 'ring' string

This is a problem appeared in today's Pacific NW Region Programming Contest during which no one solved it. It is problem B and the complete problem set is here: http://www.acmicpc-pacnw.org/icpc-statements-2011.zip. There is a well-known O(n^2) algorithm for LCS of two strings using Dynamic Programming. But when these strings are extended to rings I have no idea...
P.S. note that it is subsequence rather than substring, so the elements do not need to be adjacent to each other
P.S. It might not be O(n^2) but O(n^2lgn) or something that can give the result in 5 seconds on a common computer.
Searching the web, this appears to be covered by section 4.3 of the paper "Incremental String Comparison", by Landau, Myers, and Schmidt at cost O(ne) < O(n^2), where I think e is the edit distance. This paper also references a previous paper by Maes giving cost O(mn log m) with more general edit costs - "On a cyclic string to string correcting problem". Expecting a contestant to reproduce either of these papers seems pretty demanding to me - but as far as I can see the question does ask for the longest common subsequence on cyclic strings.
You can double the first and second string and then use the ordinary method, and later wrap the positions around.
It is a good idea to "double" the strings and apply the standard dynamic programing algorithm. The problem with it is that to get the optimal cyclic LCS one then has to "start the algorithm from multiple initial conditions". Just one initial condition (e.g. setting all Lij variables to 0 at the boundaries) will not do in general. In practice it turns out that the number of initial states that are needed are O(N) in number (they span a diagonal), so one gets back to an O(N^3) algorithm.
However, the approach does has some virtue as it can be used to design efficient O(N^2) heuristics (not exact but near exact) for CLCS.
I do not know if a true O(N^2) exist, and would be very interested if someone knows one.
The CLCS problem has quite interesting properties of "periodicity": the length of a CLCS of
p-times reapeated strings is p times the CLCS of the strings. This can be proved by adopting a geometric view off the problem.
Also, there are some additional benefits of the problem: it can be shown that if Lc(N) denotes the averaged value of the CLCS length of two random strings of length N, then
|Lc(N)-CN| is O(\sqrt{N}) where C is Chvatal-Sankoff's constant. For the averaged length L(N) of the standard LCS, the only rate result of which I know says that |L(N)-CN| is O(sqrt(Nlog N)). There could be a nice way to compare Lc(N) with L(N) but I don't know it.
Another question: it is clear that the CLCS length is not superadditive contrary to the LCS length. By this I mean it is not true that CLCS(X1X2,Y1Y2) is always greater than CLCS(X1,Y1)+CLCS(X2,Y2) (it is very easy to find counter examples with a computer).
But it seems possible that the averaged length Lc(N) is superadditive (Lc(N1+N2) greater than Lc(N1)+Lc(N2)) - though if there is a proof I don't know it.
One modest interest in this question is that the values Lc(N)/N for the first few values of N would then provide good bounds to the Chvatal-Sankoff constant (much better than L(N)/N).
As a followup to mcdowella's answer, I'd like to point out that the O(n^2 lg n) solution presented in Maes' paper is the intended solution to the contest problem (check http://www.acmicpc-pacnw.org/ProblemSet/2011/solutions.zip). The O(ne) solution in Landau et al's paper does NOT apply to this problem, as that paper is targeted at edit distance, not LCS. In particular, the solution to cyclic edit distance only applies if the edit operations (add, delete, replace) all have unit (1, 1, 1) cost. LCS, on the other hand, is equivalent to edit distances with (add, delete, replace) costs (1, 1, 2). These are not equivalent to each other; for example, consider the input strings "ABC" and "CXY" (for the acyclic case; you can construct cyclic counterexamples similarly). The LCS of the two strings is "C", but the minimum unit-cost edit is to replace each character in turn.
At 110 lines but no complex data structures, Maes' solution falls towards the upper end of what is reasonable to implement in a contest setting. Even if Landau et al's solution could be adapted to handle cyclic LCS, the complexity of the data structure makes it infeasible in a contest setting.
Last but not least, I'd like to point out that an O(n^2) solution DOES exist for CLCS, described here: http://arxiv.org/abs/1208.0396 At 60 lines, no complex data structures, and only 2 arrays, this solution is quite reasonable to implement in a contest setting. Arriving at the solution might be a different matter, though.

How to find high frequency words in a book in an environment low on memory?

Recently in a technical interview, I was asked to write a program to find the high frequency words(Words which appear maximum number of times) in a text book. The program should be designed in such a way that, it processes the entire text book with minimum memory. Performance is not a concern. I was able to program to find the frequency of words, but it consumed a lot of memory.
How do you make this operation less memory intensive? Any strategies/solutions?
-Snehal
You probably used hash tables which are memory-intensive but have a constant-lookup time--so the performance/memory trade off is obvious. By the time you reach the end of the book you will know your answer. Also, incrementing counters for each word is fast (because of the quick hashtable lookups).
The other end of the spectrum is to look at the first word, then go through the entire book to see how many times that word occurs. This requires minimal memory. Then you do the same for the next word and go through the entire book. If that word occurs more times, you add that as the top word (or top N words). Of course, this is extremely inefficient--if the first and third word are the same you'll end up going through the whole book again even though you just did the same thing for the first word.
OK, if you're only interested in the highest n occurring words, one way to do it is in two passes, with the first pass based on a modified Bloom Filter. Instead of using a bit map to track hash occurrences, use an integer array instead - either byte, 16 bit, 32 bit or even 64 bit depending on your input size. Where a Bloom filter simply sets the bit corresponding to each of the hash values of a word, you'll increment the count at the hash index in the array.
The problem with this approach is that two words will probably give the same hash values. So you need to do a second pass where you ignore words unless their hash totals are above a certain threshold, thus reducing the amount of memory you need to allocate to do accurate counting.
So just create a bit map with bits set for the highest occurring hash values. Then in the second pass of the words, if a word has "hits" in the bitmap for its hashes, look it up or add it to a hash table and increment its count. This minimises memory usage by creating a hash table of only the highest occurring words.
I'm a physicist, so my favourite approach is to approximate. You don't need to go through the entire text to get the most frequent words. Instead:
parse a chunk small enough to allow for your memory limitations,
skip a random amount of text,
repeat, combining accumulated results.
Stop when the list has satisfactorily converged.
If you use a memory-efficient algorithm for the smaller chunks (e.g. sorting) then you can get far faster performance than even the most efficient algorithm that reads every word.
Note: This does make the assumption that the most frequent words do occur most frequently throughout the text, not just at one place in the text. For english text, this assumption is true, because of the frequency of words like 'the' etc throughout. If you're worried about this requirement, require the algorithm to complete at least one pass of the entire text.
I'll probably get down-voted for this...
If the text is English and you just want to find the top 5 most frequent words, here is your program:
print "1. the\n";
print "2. of\n";
print "3. and\n";
print "4. a\n";
print "5. to\n";
Runs fast and consumes minimal memory!
If performance is really of no concern you could just go through each word in turn, check if it's in your "top N" and, if it isn't, count all it's occurrences. This way you're only storing N values. Of course, you'd be counting the same words many times, but, as you said, performance isn't an issue - and the code would be trivial (which is generally preferable - all other things being equal).
One way would be to sort the list first.
We can sort the words in-place without a lot of memory (traded with slow performance).
And then we can have a simple counting loops that finds words with maximum frequency without having to save everything in memory since they're in sorted form.
Do you mean a lot of process memory? If so, one way would be to use the disk as virtual memory (aka write a filesystem wrapper).
A possible solution is to use a trie data structure for storing all words associated to their number of occurrences.
Other solutions may be found in answers to this related question: Space-Efficient Data Structure for Storing a Word List?
Like many good interview questions, the question is phrased a little ambiguously/imprecisely, to force the interviewee to ask clarifying questions and state assumptions. I think a number of the other answers here are good, as they poke at these assumptions and demonstrate big-picture understanding.
I'm assuming the text is stored 'offline' somewhere, but there is a way to iterate over each word in the text without loading the whole text into memory.
Then the F# code below find the top N words. It's only data structure is a mapping of key-value pairs (word, frequency), and it only keeps the top N of those, so the memory use is O(N), which is small. The runtime is O(numWordsInText^2), which is poor, but acceptable given the problem constraints. The gist of the algorithm is straightforward, for each word in the text, count how many times it occurs, and if it's in the running best-N, then add it to the list and remove the previous minimum entry.
Note that the actual program below loads the entire text into memory, merely for convenience of exposition.
#light
// some boilerplate to grab a big piece of text off the web for testing
open System.IO
open System.Net
let HttpGet (url: string) =
let req = System.Net.WebRequest.Create(url)
let resp = req.GetResponse()
let stream = resp.GetResponseStream()
let reader = new StreamReader(stream)
let data = reader.ReadToEnd()
resp.Close()
data
let text = HttpGet "http://www-static.cc.gatech.edu/classes/cs2360_98_summer/hw1"
let words = text.Split([|' ';'\r';'\n'|], System.StringSplitOptions.RemoveEmptyEntries)
// perhaps 'words' isn't actually stored in memory, but so long as we can
// 'foreach' over all the words in the text we're good
let N = 5 // how many 'top frequency' words we want to find
let FindMin map =
// key-value pair with mininum value in a map
let (Some(seed)) = Map.first (fun k v -> Some(k,v)) map
map |> Map.fold_left
(fun (mk,mv) k v -> if v > mv then (mk,mv) else (k,v))
seed
let Main() =
let mutable freqCounts = Map.of_list [ ("",0) ]
for word in words do
let mutable count = 0
for x in words do
if x = word then
count <- count + 1
let minStr,minCount = FindMin freqCounts
if count >= minCount then
freqCounts <- Map.add word count freqCounts
if Seq.length freqCounts > N then
freqCounts <- Map.remove minStr freqCounts
freqCounts
|> Seq.sort_by (fun (KeyValue(k,v)) -> -v)
|> Seq.iter (printfn "%A")
Main()
Output:
[the, 75]
[to, 41]
[in, 34]
[a, 32]
[of, 29]
You could use combination of external merge sort and priority queue. Merge sort will make sure that your memory limits are honored and priority queue will maintain your top K searches. Obviously, the priority queue has to be small enough to fit into memory.
First, divide input strings into chunks, sort each chunk and store into secondary storage (external sorting) - O(n log n)
Read each chunk and within the chunk, calculate frequency of words, so at end of this step, each chunk is reduced to (unique word - frequency count) within the chunk. O(n)
Start reading elements across the chunks and aggregate for each word. Since chunks are sorted, you can do it in O(n)
Now, maintain a min priority heap (top of the heap is minimum element in the heap) of K elements. Populate priority heap by first K elements then for next (unique word -final count), if its count is greater than top element in the heap, the pop top and push current word. O(n log k)
So your final time complexity is O(n(log k + log n))
-
Well, if you want absolutely terrible performance...
Take the first word in the book, and count how many times it occurs. Take the second word in the book, count how many times it occurs. If it's more than the last word, discard the last word. And so forth... you'll end up counting the same words multiple times unless you keep a list of them somewhere, but if you really want to minimize memory, this should only require a few ints. Should run in O(n^2) time, where n is the number of words in the book.
How about create a binary tree of word keys ( as you keep reading the words from the file ). This helps to search the already repeated words in O(Log(n)). So finally you get O(nLog(n)) for top word search.
Basic algo would be
for each word in a file:
Create unique key for a given word ( weighted ascii char e.g. "bat" could be 1*'b' + 2*'a' + 3*'c';
Add this word to the tree. If the word already exists increment the new count.
Feed the word and the current count to maintainTop5(word, count). maintainTop5() maintains a dynamic list of top5 counts and associated words.
End of the file you have top 5 words.

Resources