How to find the twenty most frequently occurring words as efficiently as possible to given a file in python - hashmap

Python data structures should not be used
For time & space complexity that would be O(n) in time and space complexity, but since there may be lots of words we cannot assume that we can store everything in memory. Worst case time complexity should be reduced, At present it is O(N^4)
Number of words to be displayed and order of display should be provided as configurable option.

Related

Python big texts comparing

I'm not at good with math and I post my question here. Hope, will not get tons of dislikes.
I have a lot of big texts from 200.000 to 1.000.000 chars in each of them. And I need to compare texts to find duplicates. I decided to use fingerprint (md5 hashing) and then compare the fingerprint. But then I realised a new way of comparison - count chars in text.
So which one will work faster and which one will get less CPU power?
P.S. IMPORTANT: there CANNOT be 2 or more different texts with the same chars count
Taking the length of the string will be a lot faster and use less cpu power
This is because it is only one task and is easy for python and has the benifet of being an in built function.
However to preform an md5, it will need to do calculations on each character to produce the overall hash which will take a lot longer.
If the texts are exact duplicates you can get the hashes, or even faster, the lengths of texts and sort the lengths (coupled by id of text or by text reference itself) identifying the repetitions of lengths (or hashes).
For sorting you can use fast sorting algorithm, for example quicksort.
In fact there is even special *nix command line utility for sorting the items with support of duplicate removal, it is sort -u.
If the texts are near duplicates, not exact ones, the things go harder, you need to use special duplication aware hashing algorithms and sort the resultant hashes using their similarity metrics advanced so they count near things similar if distance between two compared items is lesser then some threshold of similarly.
Then again pass by resulting sorted list and get the near duplicates.

Efficient sampling of discrete random variable [duplicate]

I have a list of US names and their respective names from the US census website. I would like to generate a random name from this list using the given probability. The data is here: US Census data
I have seen algorithms like the roulette wheel selection algorithm that are easy to implement, but I wanted to know if there was any way to generate random names in O(1). For histogram data this is easier, as you could create a hash of integers to birthdays, but I would like to do this for a continuous distribution.
If this is not possible, are there any python modules that take in probability distributions and generate random values based on those distributions?
There is an O(1)-time method See this detailed description of Vose's "alias" method. Unfortunately, it suffers from high initialization cost. For comparative timings of simpler methods, see Eli Bendersky's blog post. More timings can be found in this from the Python issue tracker.
These days it's practical to enumerate the entire US population (~317 million) if you really need O(1) lookup. Just pick a number up to 317 million and get the name from there. (317000000*4 bytes = 1.268GB)
I think there are lots of O(log n) ways. Is there a particular reason you need O(1) (They will use a lot less memory)

NLP - Improving Running Time and Recall of Fuzzy string matching

I have made a working algorithm but the running time is very horrible. Yes, I know from the start that it will be horrible but not that much. For just 200000 records, the program runs for more than an hour.
Basically what I am doing is:
for each searchfield in search fields
for each sample in samples
do a q-gram matching
if there are matches then return it
else
split the searchfield into uniwords
for each sample in samples
split sample into uniwords
for each uniword in samples
if the uniword is a known abbreviation
then search the dictionary for its full word or other known abbr
else do a jaro-winkler matching
average the distances of all the uniwords
if the average is above threshold then make it as a match and break
end for
if there is a match make a comment that it matched one of the samples partially
end else
end for
Yes, this code is very loop-happy. I am using brute-force because the recall is very important. So, I'm wondering how can I make it faster since I am not only running it for 200000 data for millions of data and the computers of the client are not high-end (1GB-2GB of Ram Pentium 4 or Dual-Core, the computer where I test this program is a Dual Core with 4GB of Ram). I came across TF/IDF but I do not know if it will be sufficient. And I wonder how can google make searches real time.
Thanks in advance!
Edit:
This program is a data filterer. From 200,000 dummy data (actual data is about 12M), I must filter data that is irrelevant to the samples (500 dummy samples, I still do not know how much the actual amount of samples).
With the given dummy data and samples, the running time is about 1 hour but after tinkering here and there, I have successfully lessen it to 10-15 minutes. I have lessen it by grouping the fields and samples that begin with the same character (discounting special and non-meaningful words e.g. the, a, an) and matching the fields to the sample with the same first character. I know there is a problem there. What if the field was misspelled at the first character? But I think the number of those are negligible. The samples are spelled correctly since it is always maintained.
what is your programing language? I guess using q=2 or 3 is sufficient. Also I suggested to come from uni gram to higher degrees.

Why is random not so random?

Can someone provide an explanation as to how modern programming languages (java, c#, python, javascript) cope with the limitations of randomness and where those limitations (time-based seeds for example) originate. I.e if they are imposed by the underlying operating systems and intel based hardware.
Basically i'd like to understand why there is no such thing as a truly random number without the appropriate hardware.
I'm going to answer the second part of your question first:
Basically I'd like to understand why there is no such thing as a truly random number without the appropriate hardware.
You can't generate truly random numbers on a computer without special hardware because computers are deterministic machines. What this means is that, given some initial state and an operation to perform, you can predict exactly how the machine will evolve. For instance, if you know that, on some hypothetical architecture, that register %d0 contains 24 and register %d1 contains 42, and you know that the next instruction in the instruction stream is add %d0 %d1 %d2, you then know that, after that instruction is executed, %d2 will contain 66. In a higher-level language, you know that writing x = 1; y = 2; z = x + y will result in z being 3 with certainty.
This makes sense; we don't want to wonder what an addition will do, we want it to add. However, this is incompatible with generating truly random numbers. For a number to be truly random, there needs to be absolutely no way to predict it, no matter what you know. Certain quantum-mechanical processes have this behavior precisely, and other natural processes are close enough to random that, for all practical purposes, they are (for instance, if they look random and predicting them would require knowing the state of every molecule in the atmosphere). However, computers cannot do this, because the whole point of having a computer is to have a machine which deterministically executes code. You need to be able to predict what will happen when you run programs, else what's the point?
In a comment to Milan Ramaiya's answer, you said
I agree with [yo]u but still missing the most important thing - why cant computers produce a random number with pre-determined input?
The answer falls out directly from the definition of a truly random number. Since a truly random number needs to be completely unpredictable, it can never depend on deterministic input. If you have an algorithm which takes pre-determined input and uses it to produce a pseudo-random number, you can duplicate this process at will just as long as you know the input and algorithm.
You also asked
Can someone provide an explanation as to how modern programming languages … cope with the limitations of randomness and where those limitations … originate.
Well, as mentioned above, the limitations are inherent to the deterministic design of our languages and machines, which are there for good reasons (so that said languages and machines are usable :-) ). Assuming you aren't calling out to something which does have access to truly random numbers (such as /dev/random on systems where it exists), the approach taken is to use a pseudo-random number generator. These algorithms are designed to produce a statistically random output sequence—one which, in a formal sense, looks unpredictable. I don't know enough statistics to explain or understand the details of this, but I believe the idea is that there are certain numeric tests you can run to tell how well your data predicts itself (in some loose sense) and things like that. However, the important point is that, while the sequence is deterministic, it "looks random". For many purposes, this is enough! And sometimes it has advantages: if you want to test code, for instance, it can be nice to be able to specify a seed and always have it receive the same sequence of pseudo-random numbers.
In summary, the overall answer to your question is this: Because we want to be able to predict what computers do, they can't generate unpredictable numbers (without special hardware). Programming languages aren't generally too impacted by this, because pseudo-random number generators are sufficient for most cases.
Software by design is deterministic. So the way random numbers are typically generated is by using a formula that spits data in statistically random order. This way, any program that needs a uniform distribution of numbers could set a seed based on some physical data (ie: timestamp) and get what will look like a random set of numbers. However, given a specific set of inputs, software will always perform in the same manner.
To have true random, there needs to be input which is nondeterministic.
Quoting Wikipedia,
To generate truly random numbers
requires precise, accurate, and
repeatable system measurements of
absolutely non-deterministic
processes. The open source operating
system Linux uses, for example,
various system timings (like user
keystrokes, I/O, or least-significant
digit voltage measurements) to produce
a pool of random numbers. It attempts
to constantly replenish the pool,
depending on the level of importance,
and so will issue a random number.
This system is an example, and similar
to those of dedicated hardware random
number generators.
Computers generate random numbers by taken them from a long list of pre-generated values. Using a seed value helps to create different results every time the program is run, but isn't a fix-all because the list is fixed - it only changes the start position within that list. Computers are, obviously, very rigid in how they do things in that they can't do something truly random due to the limitations of how they are made. Sites like random.org create random numbers from external sources like radio noise. Maybe computers should take the noise from the power supply and use that as a truly random base? :-P
Systems are designed to be predictable and discrete, nobody wants chaotic computers in order to people can programme them.
Predictable systems can't produce truly random numbers, only predictable numbers.
Software random numbers has two basic steps:
- generate a pseudo random number
- manipulate this pseudo to obtain a number in a range more useful (0 to 1, 1 to 100, etc.)
A common problem in software random number generators is that always has loops.
These loops are composed of a fixed set of numbers (the algorithm can't generate other numbers)
If algorithm is good that loop implies a very very big set of numbers
But if the algorithm is bad numbers set may be insufficient
These generated numbers are processed to obtain numbers only in 1 to 100 or 0 to 1 (for example) in order to they are useful to your program.
As the original algorithm isn't able to generate all numbers in a range, resulting set will get some numbers more often than others.

How do search engines conduct 'AND' operation?

Consider the following search results:
Google for 'David' - 591 millions hits in 0.28 sec
Google for 'John' - 785 millions hits in 0.18 sec
OK. Pages are indexed, it only needs to look up the count and the first few items in the index table, so speed is understandable.
Now consider the following search with AND operation:
Google for 'David John' ('David' AND 'John') - 173 millions hits in 0.25 sec
This makes me ticked ;) How on earth can search engines get the result of AND operations on gigantic datasets so fast? I see the following two ways to conduct the task and both are terrible:
You conduct the search of 'David'. Take the gigantic temp table and conduct a search of 'John' on it. HOWEVER, the temp table is not indexed by 'John', so brute force search is needed. That just won't compute within 0.25 sec no matter what HW you have.
Indexing by all possible word
combinations like 'David John'. Then
we face a combinatorial explosion on the number of keys and
not even Google has the storage
capacity to handle that.
And you can AND together as many search phrases as you want and you still get answers under a 0.5 sec! How?
What Markus wrote about Google processing the query on many machines in parallel is correct.
In addition, there are information retrieval algorithms that make this job a little bit easier. The classic way to do it is to build an inverted index which consists of postings lists - a list for each term of all the documents that contain that term, in order.
When a query with two terms is searched, conceptually, you would take the postings lists for each of the two terms ('david' and 'john'), and walk along them, looking for documents that are in both lists. If both lists are ordered the same way, this can be done in O(N). Granted, N is still huge, which is why this will be done on hundreds of machines in parallel.
Also, there may be additional tricks. For example, if the highest-ranked documents were placed higher on the lists, then maybe the algorithm could decide that it found the 10 best results without walking the entire lists. It would then guess at the remaining number of results (based on the size of the two lists).
I think you're approaching the problem from the wrong angle.
Google doesn't have a tables/indices on a single machine. Instead they partition their dataset heavily across their servers. Reports indicate that as many as 1000 physical machines are involved in every single query!
With that amount of computing power it's "simply" (used highly ironically) a matter of ensuring that every machine completes their work in fractions of a second.
Reading about Google technology and infrastructure is very inspiring and highly educational. I'd recommend reading up on BigTable, MapReduce and the Google File System.
Google have an archive of their publications available with lots of juicy information about their techologies. This thread on metafilter also provides some insight to the enourmous amount of hardware needed to run a search engine.
I don't know how google does it, but I can tell you how I did it when a client needed something similar:
It starts with an inverted index, as described by Avi. That's just a table listing, for every word in every document, the document id, the word, and a score for the word's relevance in that document. (Another approach is to index each appearance of the word individually along with its position, but that wasn't required in this case.)
From there, it's even simpler than Avi's description - there's no need to do a separate search for each term. Standard database summary operations can easily do that in a single pass:
SELECT document_id, sum(score) total_score, count(score) matches FROM rev_index
WHERE word IN ('david', 'john') GROUP BY document_id HAVING matches = 2
ORDER BY total_score DESC
This will return the IDs of all documents which have scores for both 'David' and 'John' (i.e., both words appear), ordered by some approximation of relevance and will take about the same time to execute regardless of how many or how few terms you're looking for, since IN performance is not affected much by the size of the target set and it's using a simple count to determine whether all terms were matched or not.
Note that this simplistic method just adds the 'David' score and the 'John' score together to determine overall relevance; it doesn't take the order/proximity/etc. of the names into account. Once again, I'm sure that google does factor that into their scores, but my client didn't need it.
I did something similar to this years ago on a 16 bit machine. The dataset had an upper limit of around 110,000 records (it was a cemetery, so finite limit on burials) so I setup a series of bitmaps each containing 128K bits.
The search for "david" resulting in me setting the relevant bit in one of the bitmaps to signify that the record had the word "david" in it. Did the same for 'john' in a second bitmap.
Then all you need to do is a binary 'and' of the two bitmaps, and the resulting bitmap tells you which record numbers had both 'david' and 'john' in them. Quick scan of the resulting bitmap gives you back the list of records that match both terms.
This technique wouldn't work for google though, so consider this my $0.02 worth.

Resources