Searching for closest statistically significant match in k-dimensional set - search

At a very high level this is similar to the nearest neighbor search problem.
From wiki: "given a set S of points in a space M and a query point q ∈ M, find the closest point in S to q".
But some significant differences. Specifics:
Each point is described by k variables.
The variables are not all numerical. Mixed data types:
string, int etc.
All possible values for all variables not known - but they come from reasonably small sets.
In the data set to search from there will be multiple points with same values for all the k variables.
Another way to look at this is there will be many duplicate points.
For each point lets call the number of duplicates as frequency.
Given a query point q need to find nearest neighbor p such that frequency of p is at-least 15
There seems to be a wide range of of algorithms around NNS and statistical classification and best bin match.
I am getting a little lost in all the variations. Is there already a standard algorithm I can use. Or would I need to modify one?

Related

Efficient algorithm for nearest point search in a grid

I am looking for an algorithm that can do efficient search in a grid.
I have a large array which includes all the centroid points (x,y,z)
Now for a given location (xp,yp,zp) I want to find the closest centroid to that p location.
Currently I am doing a brute force search which basically for each point p I go through all points, calculate the distance to location p and by this find out which centroid that is.
I know that octree search and kd-tree might help but not too sure how to tackle it or which one would be better.
I would you a spatial index, such as the kd-tree or quadtree/octree (which you suggested) or maybe an R-Tree based solution.
Put all your centroids into the index. Usually you can associated any point in the index with some additional data, so if you need that, you could provide a back-treference into the grids, for example grid coordinates).
Finding the nearest point in the index should be very fast. The returned data then allows you to go back into the grid.
In a way, a quadtree/octree is in itself nothing but a discretizing grid that get finer if the point density increases. The difference to a grid is that it is hierarchical and that empty areas are not stored at all.

Lua - Finding the best match for a string

I was curious if anyone had a good method of choosing the best matching case between strings. For example, say I have a table with keys “Hi there”, “Hello”, “Hiya”, “hi”, “Hi”, and “Hey there”. The I want to find the closest match for “Hi”. It would then match to the “Hi” first. If that wasn’t found, then the “hi” then “Hiya”, and so on. Prioritizing perfect matches, then lower/uppercase matches, then which ever had the least number of differences or length difference.
My current method seems unwieldy, first checking for a perfect match, then looping around with a string.match, saving any with the closest string.len.
If you're not looking for a perfect match only, you need to use some metric as a measure of similarity and then look for the closest match.
As McBarby suggested in his comment you can use the Levenshtein distance which is the minimum number of single character edits necessary to get from string 1 to string 2. Just research which metrics are available and which one suits your needs best. Of course you can also define your own metric.
https://en.wikipedia.org/wiki/String_metric lists a number of other string metrics:
Sørensen–Dice coefficient
Block distance or L1 distance or City block distance
Jaro–Winkler distance
Simple matching coefficient (SMC)
Jaccard similarity or Jaccard coefficient or Tanimoto coefficient
Tversky index
Overlap coefficient
Variational distance
Hellinger distance or Bhattacharyya distance
Information radius (Jensen–Shannon divergence)
Skew divergence
Confusion probability
Tau metric, an approximation of the Kullback–Leibler divergence
Fellegi and Sunters metric (SFS)
Maximal matches
Grammar-based distance
TFIDF distance metric

Averaging many curves with different x and y values

I have several curves that contain many data points. The x-axis is time and let's say I have n curves with data points corresponding to times on the x-axis.
Is there a way to get an "average" of the n curves, despite the fact that the data points are located at different x-points?
I was thinking maybe something like using a histogram to bin the values, but I am not sure which code to start with that could accomplish something like this.
Can Excel or MATLAB do this?
I would also like to plot the standard deviation of the averaged curve.
One concern is: The distribution amongst the x-values is not uniform. There are many more values closer to t=0, but at t=5 (for example), the frequency of data points is much less.
Another concern. What happens if two values fall within 1 bin? I assume I would need the average of these values before calculating the averaged curve.
I hope this conveys what I would like to do.
Any ideas on what code I could use (MATLAB, EXCEL etc) to accomplish my goal?
Since your series' are not uniformly distributed, interpolating prior to computing the mean is one way to avoid biasing towards times where you have more frequent samples. Note that by definition, interpolation will likely reduce the range of your values, i.e. the interpolated points aren't likely to fall exactly at the times of your measured points. This has a greater effect on the extreme statistics (e.g. 5th and 95th percentiles) rather than the mean. If you plan on going this route, you'll need the interp1 and mean functions
An alternative is to do a weighted mean. This way you avoid truncating the range of your measured values. Assuming x is a vector of measured values and t is a vector of measurement times in seconds from some reference time then you can compute the weighted mean by:
timeStep = diff(t);
weightedMean = timeStep .* x(1:end-1) / sum(timeStep);
As mentioned in the comments above, a sample of your data would help a lot in suggesting the appropriate method for calculating the "average".

Is there a standard metric for sorted text?

Given a range of numbers, say from [80,240], it is easy to determine how much of that range lies within [100,105]: (105-100)/(240-80) = 5/160 = .03125. Easy.
So now, how much of a Meriam Webster dictionary lies between umbrella and velvet? Even if we assume uniform distribution of text across the corpus, is there a standard metric for text?
I don't think there is a standard for that. If you had all entries from Meriam Webster in an array, you could use first and last positions as the bounds, so you would have a set going from 1 to n. Then you could pick the positions of "umbrella" and "velvet", call them x and y, and calculate your range as (y - x + 1) / (n).
That works if you are seeing words as elements of an ordered set, so as to have them behave as real numbers. You are basically dividing the distance between two numbers in a set by the distance between the boundaries of the set. Some forms of algebra deal with them differently - when calculating the Levenshtein distance between any two given words, for example, each words is seen as a vector with as many dimensions as they have characters.
You could define the boundaries of your n-dimensional space by using the biggest word in Meriam Webster (hint: it's "pneumonoultramicroscopicsilicovolcanoconiosis", so your space would have 45 dimensions). However, when considering any A-B pair of words, a third word C of intermediary length may or may not be between those, depending on the operations involved in the transformation from A to B.
You'd have to check every word with a length between that of A and B to check whether they are part of the range between A and B... So it's not a matter of simple calculus, and I don't know if this could be even feasible with a regular computer nowadays. And that's just considering Meriam's close to half a million entries.

How does clustering (especially String clustering) work?

I heard about clustering to group similar data. I want to know how it works in the specific case for String.
I have a table with more than different 100,000 words.
I want to identify the same word with some differences (eg.: house, house!!, hooouse, HoUse, #house, "house", etc...).
What is needed to identify the similarity and group each word in a cluster? What algorithm is more recommended for this?
To understand what clustering is imagine a geographical map. You can see many distinct objects (such as houses). Some of them are close to each other, and others are far. Based on this, you can split all objects into groups (such as cities). Clustering algorithms make exactly this thing - they allow you to split your data into groups without previous specifying groups borders.
All clustering algorithms are based on the distance (or likelihood) between 2 objects. On geographical map it is normal distance between 2 houses, in multidimensional space it may be Euclidean distance (in fact, distance between 2 houses on the map also is Euclidean distance). For string comparison you have to use something different. 2 good choices here are Hamming and Levenshtein distance. In your particular case Levenshtein distance if more preferable (Hamming distance works only with the strings of same size).
Now you can use one of existing clustering algorithms. There's plenty of them, but not all can fit your needs. For example, pure k-means, already mentioned here will hardly help you since it requires initial number of groups to find, and with large dictionary of strings it may be 100, 200, 500, 10000 - you just don't know the number. So other algorithms may be more appropriate.
One of them is expectation maximization algorithm. Its advantage is that it can find number of clusters automatically. However, in practice often it gives less precise results than other algorithms, so it is normal to use k-means on top of EM, that is, first find number of clusters and their centers with EM and then use k-means to adjust the result.
Another possible branch of algorithms, that may be suitable for your task, is hierarchical clustering. The result of cluster analysis in this case in not a set of independent groups, but rather tree (hierarchy), where several smaller clusters are grouped into one bigger, and all clusters are finally part of one big cluster. In your case it means that all words are similar to each other up to some degree.
There is a package called stringdist that allows for string comparison using several different methods. Copypasting from that page:
Hamming distance: Number of positions with same symbol in both strings. Only defined for strings of equal length.
Levenshtein distance: Minimal number of insertions, deletions and replacements needed for transforming string a into string b.
(Full) Damerau-Levenshtein distance: Like Levenshtein distance, but transposition of adjacent symbols is allowed.
Optimal String Alignment / restricted Damerau-Levenshtein distance: Like (full) Damerau-Levenshtein distance but each substring may only be edited once.
Longest Common Substring distance: Minimum number of symbols that have to be removed in both strings until resulting substrings are identical.
q-gram distance: Sum of absolute differences between N-gram vectors of both strings.
Cosine distance: 1 minus the cosine similarity of both N-gram vectors.
Jaccard distance: 1 minues the quotient of shared N-grams and all observed N-grams.
Jaro distance: The Jaro distance is a formula of 4 values and effectively a special case of the Jaro-Winkler distance with p = 0.
Jaro-Winkler distance: This distance is a formula of 5 parameters determined by the two compared strings (A,B,m,t,l) and p chosen from [0, 0.25].
That will give you the distance. You might not need to perform a cluster analysis, perhaps sorting by the string distance itself is sufficient. I have created a script to provide the basic functionality here... feel free to improve it as needed.
You can use an algorithm like the Levenshtein distance for the distance calculation and k-means for clustering.
the Levenshtein distance is a string metric for measuring the amount of difference between two sequences
Do some testing and find a similarity threshold per word that will decide your groups.
You can use a clustering algorithm called "Affinity Propagation". This algorithm takes in an input called similarity matrix which you can generate by taking negative of the either Levenstein distance or an harmonic mean of partial_ratio and token_set_ratio from fuzzywuzzy library if you are using Python.

Resources