Is there a data-structure designed specifically for fast insertion and search of multidimensional coordinates (many more than 2 or 3d, for all practical purposes say less than 1k dimensions and 1M points)? Even better, for arbitrary distance metrics?
I know about kd-trees, which are good for insertion, but as far as I know, balancing them is non-trivial, and search is not very efficient in higher dimensions. Unordered maps / hash tables would be a good solution at first glance, but as far as I know there are issues with hashing and collisions (eg converting to a string often truncates the numerical precision, and dealing with collisions of non-neighbouring points can be expensive). Maybe something like a red-black tree on each dimension would be good for insertion, and not too bad for search (recursively filtering along dimensions).
I just don't want to reinvent the wheel and I am sure this is a common need in data sciences these days. Happy to take links to papers / tutorials as an answer. Ideally the answer would have an existing implementation in C / C++ / Python / Java / Matlab.
The data structure you're looking for is R-Tree.
You can find Java implementation here.
https://github.com/davidmoten/rtree
Related
I have a list of US names and their respective names from the US census website. I would like to generate a random name from this list using the given probability. The data is here: US Census data
I have seen algorithms like the roulette wheel selection algorithm that are easy to implement, but I wanted to know if there was any way to generate random names in O(1). For histogram data this is easier, as you could create a hash of integers to birthdays, but I would like to do this for a continuous distribution.
If this is not possible, are there any python modules that take in probability distributions and generate random values based on those distributions?
There is an O(1)-time method See this detailed description of Vose's "alias" method. Unfortunately, it suffers from high initialization cost. For comparative timings of simpler methods, see Eli Bendersky's blog post. More timings can be found in this from the Python issue tracker.
These days it's practical to enumerate the entire US population (~317 million) if you really need O(1) lookup. Just pick a number up to 317 million and get the name from there. (317000000*4 bytes = 1.268GB)
I think there are lots of O(log n) ways. Is there a particular reason you need O(1) (They will use a lot less memory)
I'm trying to create my own equalizer. I want to implement 10 IIR bandpass filters. I know the equations to calculate those but I read that for higher center frequencies (above 6000Hz) they should be calculated differently. Of course I have no idea how (and why). Or maybe it's all lies and I don't need other coefficients?
Source: http://cache.freescale.com/files/dsp/doc/app_note/AN2110.pdf
You didn't read closely enough; the application note says "f_s/8 (or 6000Hz)", because for the purpose of where that's written, the sampling rate is 48000Hz.
However, that is a very narrowing look at filters; drawing the angles involved in equations 4,5,6 from that applications notes into an s-plane diagram this looks like it'd make sense, but these aren't the only filter options there are. The point the AN makes is that these are simple formulas, that approximate a "good" filter (because designing an IIR is usually a bit more complex), and they can only be used below f_2/8. I haven't tried to figure out what happens mathematically at higher frequencies, but I just guess the filters aren't as nicely uniform afterwards.
So, my approach would simply be using any filter design tool to calculate coefficients for you. You could, for example, use Matlab's filter design tool, or you could use GNU Radio's gr_filter_design, to give you IIRs. However, automatically found IIRs will usually be longer than 3 taps, unless you know very well how you mathematically define your design requirements so that the algorithm does what you want.
As much as I like the approach of using IIRs for audio equalization, where phase doesn't matter, I'd say the approach in the Application node is not easy to understand unless one has very solid background in filter/system theory. I guess you'd either study some signal theory with an electrical engineering textbook, or you just accept the coefficients as they are given on p. 28ff.
In the context of a large-scale data mining benchmarking study, I am comparing 15 algorithms over 9 data sets, leading to an overall 135 algorithm/dataset combinations. The study is done using WEKA.
My last analysis is concerned with the influence of feature selection. I am aware, that there is no such thing as the perfect feature selection algorithm but the optimal choice rather depends on both algorithm to be deployed and the data set to which it will be applied.
Although the problem is to large to find the optimal feature selection algorithm for each combination, I am looking for ones that are considered to show a good performance in general, 'allrounder' so to say.
So far I have found recommendation for CFS (Correlation-based feature selection), ReliefF and Consistency-based subset evaluation (Hall / Holmes 2002) as a generally good choice as well as the note from a survey, that methods as simple as Rankers (e.g. Correlation coefficient) proved quiet effective (Guyon / Ellissef 2003).
Is there a good benchmark study some other research indicating which methods to use or which ones to use in practice?
From a Text Classification point of view, there is one article by Yang etal. comparing different feature selection algorithms (chi square, document frequency and Information Gain).
Although it is focus on text (i.e., the document frequency won't apply to you at all) the others might, depending on the nature of your features (i.e., binary or not, always present, ...)
I hope this helps.
I have a python app with a database of businesses and I want to be able to search for businesses by name (for autocomplete purposes).
For example, consider the names "best buy", "mcdonalds", "sony" and "apple".
I would like "app" to return "apple", as well as "appel" and "ple".
"Mc'donalds" should return "mcdonalds".
"bst b" and "best-buy" should both return "best buy".
Which algorithm am I looking for, and does it have a python implementation?
Thanks!
The Levenshtein distance should do.
Look around - there are implementations in many languages.
Levenshtein distance will do this.
Note: this is a distance, you have to calculate it to every string in your database, which can be a big problem if you have a lot of entries.
If you have this problem then record all the typos the users make (typo=no direct match) and offline build a correction database which contains all the typo->fix mappings. some companies do this even more clever, eg: google watches how users correct their own typos and learns the mappings from this.
Soundex or Metaphone might work.
I think what you are looking for is a huge field of Data Quality and Data Cleansing. I fear if you could find a python implementation regarding this as it has to be something which cleanses considerable amount of data in db which could be of business value.
Levensthein distance goes in the right direction but only half the way. There are several tricks to get it to use the half matches as well.
One would be to use a subsequence dynamic time warping (DTW is actually a generalization of levensthein distance). For this you relax the start and end cases when calcualting the cost matrix. If you only relax one of the conditions you can get autocompletion with spell checking. I am not sure if there is a python implementation available, but if you want to implement it for yourself it should not be more than 10-20 LOC.
The other idea would be to use a Trie for speed up, which can do DTW/Levensthein on multiple results simultaniously (huge speedup if your database is large). There is a paper on Levensthein on Tries at IEEE, so you can find the algorithm there. Again for this you would need to relax the final boundary condition, so you get partial matches. However since you step down in the trie you just need to check when you have fully consumed the input and then return all leaves.
check this one http://docs.python.org/library/difflib.html
it should help you
Pretty common situation, I'd wager. You have a blog or news site and you have plenty of articles or blags or whatever you call them, and you want to, at the bottom of each, suggest others that seem to be related.
Let's assume very little metadata about each item. That is, no tags, categories. Treat as one big blob of text, including the title and author name.
How do you go about finding the possibly related documents?
I'm rather interested in the actual algorithm, not ready solutions, although I'd be ok with taking a look at something implemented in ruby or python, or relying on mysql or pgsql.
edit: the current answer is pretty good but I'd like to see more. Maybe some really bare example code for a thing or two.
This is a pretty big topic -- in addition to the answers people come up with here, I recommend tracking down the syllabi for a couple of information retrieval classes and checking out the textbooks and papers assigned for them. That said, here's a brief overview from my own grad-school days:
The simplest approach is called a bag of words. Each document is reduced to a sparse vector of {word: wordcount} pairs, and you can throw a NaiveBayes (or some other) classifier at the set of vectors that represents your set of documents, or compute similarity scores between each bag and every other bag (this is called k-nearest-neighbour classification). KNN is fast for lookup, but requires O(n^2) storage for the score matrix; however, for a blog, n isn't very large. For something the size of a large newspaper, KNN rapidly becomes impractical, so an on-the-fly classification algorithm is sometimes better. In that case, you might consider a ranking support vector machine. SVMs are neat because they don't constrain you to linear similarity measures, and are still quite fast.
Stemming is a common preprocessing step for bag-of-words techniques; this involves reducing morphologically related words, such as "cat" and "cats", "Bob" and "Bob's", or "similar" and "similarly", down to their roots before computing the bag of words. There are a bunch of different stemming algorithms out there; the Wikipedia page has links to several implementations.
If bag-of-words similarity isn't good enough, you can abstract it up a layer to bag-of-N-grams similarity, where you create the vector that represents a document based on pairs or triples of words. (You can use 4-tuples or even larger tuples, but in practice this doesn't help much.) This has the disadvantage of producing much larger vectors, and classification will accordingly take more work, but the matches you get will be much closer syntactically. OTOH, you probably don't need this for semantic similarity; it's better for stuff like plagiarism detection. Chunking, or reducing a document down to lightweight parse trees, can also be used (there are classification algorithms for trees), but this is more useful for things like the authorship problem ("given a document of unknown origin, who wrote it?").
Perhaps more useful for your use case is concept mining, which involves mapping words to concepts (using a thesaurus such as WordNet), then classifying documents based on similarity between concepts used. This often ends up being more efficient than word-based similarity classification, since the mapping from words to concepts is reductive, but the preprocessing step can be rather time-consuming.
Finally, there's discourse parsing, which involves parsing documents for their semantic structure; you can run similarity classifiers on discourse trees the same way you can on chunked documents.
These pretty much all involve generating metadata from unstructured text; doing direct comparisons between raw blocks of text is intractable, so people preprocess documents into metadata first.
You should read the book "Programming Collective Intelligence: Building Smart Web 2.0 Applications" (ISBN 0596529325)!
For some method and code: First ask yourself, whether you want to find direct similarities based on word matches, or whether you want to show similar articles that may not directly relate to the current one, but belong to the same cluster of articles.
See Cluster analysis / Partitional clustering.
A very simple (but theoretical and slow) method for finding direct similarities would be:
Preprocess:
Store flat word list per article (do not remove duplicate words).
"Cross join" the articles: count number of words in article A that match same words in article B. You now have a matrix int word_matches[narticles][narticles] (you should not store it like that, similarity of A->B is same as B->A, so a sparse matrix saves almost half the space).
Normalize the word_matches counts to range 0..1! (find max count, then divide any count by this) - you should store floats there, not ints ;)
Find similar articles:
select the X articles with highest matches from word_matches
This is a typical case of Document Classification which is studied in every class of Machine Learning. If you like statistics, mathematics and computer science, I recommend that you have a look at the unsupervised methods like kmeans++, Bayesian methods and LDA. In particular, Bayesian methods are pretty good at what are you looking for, their only problem is being slow (but unless you run a very large site, that shouldn't bother you much).
On a more practical and less theoretical approach, I recommend that you have a look a this and this other great code examples.
A small vector-space-model search engine in Ruby. The basic idea is that two documents are related if they contain the same words. So we count the occurrence of words in each document and then compute the cosine between these vectors (each terms has a fixed index, if it appears there is a 1 at that index, if not a zero). Cosine will be 1.0 if two documents have all terms common, and 0.0 if they have no common terms. You can directly translate that to % values.
terms = Hash.new{|h,k|h[k]=h.size}
docs = DATA.collect { |line|
name = line.match(/^\d+/)
words = line.downcase.scan(/[a-z]+/)
vector = []
words.each { |word| vector[terms[word]] = 1 }
{:name=>name,:vector=>vector}
}
current = docs.first # or any other
docs.sort_by { |doc|
# assume we have defined cosine on arrays
doc[:vector].cosine(current[:vector])
}
related = docs[1..5].collect{|doc|doc[:name]}
puts related
__END__
0 Human machine interface for Lab ABC computer applications
1 A survey of user opinion of computer system response time
2 The EPS user interface management system
3 System and human system engineering testing of EPS
4 Relation of user-perceived response time to error measurement
5 The generation of random, binary, unordered trees
6 The intersection graph of paths in trees
7 Graph minors IV: Widths of trees and well-quasi-ordering
8 Graph minors: A survey
the definition of Array#cosine is left as an exercise to the reader (should deal with nil values and different lengths, but well for that we got Array#zip right?)
BTW, the example documents are taken from the SVD paper by Deerwester etal :)
Some time ago I implemented something similiar. Maybe this idea is now outdated, but I hope it can help.
I ran a ASP 3.0 website for programming common tasks and started from this principle: user have a doubt and will stay on website as long he/she can find interesting content on that subject.
When an user arrived, I started an ASP 3.0 Session object and recorded all user navigation, just like a linked list. At Session.OnEnd event, I take first link, look for next link and incremented a counter column like:
<Article Title="Cookie problem A">
<NextPage Title="Cookie problem B" Count="5" />
<NextPage Title="Cookie problem C" Count="2" />
</Article>
So, to check related articles I just had to list top n NextPage entities, ordered by counter column descending.