Programming machine learning, compare two plotted lines with x y coordinates - geometry

So I have multiple paths stored, each path would consist of data points x1,y1 | x2, y2 | x3, y3 ... etc
I would like to compare these paths with one another to work out if any similarities are present.
I could run through each point and see if it matched any of the points in the first path, then look to see if the next point matches the next point.
I think this would work if there were no anomalies, but could skip over if the next point did not match.
I would like to build in some level of tolerance eg 10, 10 may match 12, 12 or 8, 8
Is this a good way to compare the data, or is there a better approach?
As a second step I may want to consider time as a value too, so each point would have a time value associated with it.

Some possible approaches you can use:
handle booth paths as polygon and compare them as such
see: How to compare two shapes?
use OCR algorithms/approaches
see: OCR and character similarity
transform both paths to synchronized dataset and correlate
either extract significant points only and/or resample paths to the same point count. Then synchronize booth datasets (as in bullet 1) and use correlation coefficient
[notes]
Depending on the input data you can also exploit DCT/DFT transforms to remove unimportant data (like in JPG compression) And or compare in frequency domain instead of spatial/time domain.
You can also compare obvious things (invariant on rotation and translation) like
area
perimeter length
number of self-intersections
number of inflex points

u could compare the mean and variances of the two set of points. If they are on straight lines, as you hypothesize, you could fit straight lines through the two datasets and then compare the parameters of the two straight lines to infer about their distances. It would be more helpful if you could tell the behavour of the two datasets.

Related

Excel - How to find intersection point of two lines on a graph

I have a graph which has two lines.
The graph is generated from "random" data. I.e. not based on a formula or pattern. But there is always a point where the two lines intersect.
I'm trying to provide exact point (on x and y axis) where the lines cross.
Ive tried using slope/intercept formulas
And what if analysis.
However these methods only seem to work if the data is based on a formula or pattern.
I can sort the data and find the point where they are at their closest then take an average using data around that point to get an approximate match.
However is there any way to do this more accurately, or does the nature of my data(random data points) make this not possible using formulas/equations

Generate positive only distribution based on array

I have an array of data, for example:
[1000,800,700,650,630,500,370,350,310,250,210,180,150,100,80,50,30,20,15,12,10,8,6,3]
From this data, I want to generate random numbers that fit the same distribution.
I can generate a random number using code like the following:
dist = scipy.stats.gaussian_kde(data)
randomVar = np.floor(dist.resample()[0])
This results in random number generation that includes negative numbers, which I believe I can dump fairly easily without changing the overall shape of the rest of the curve (I just generate sufficient resamples that I still have enough for purpose after dumping the negatives).
However, because the original data was positive values only - and heaped up against that boundary, I end up with a kde that is highest a short distance before it gets to zero, but then drops off sharply from there as it approaches zero; and that downward tick in the KDE is preventing me from generating appropriate numbers.
I can set the bandwidth lower, in order to get a sharper corner, closer to zero, but then due to the low quantity of the original data it ends up sawtoothing elsewhere. Higher bandwidths unfortunately hide the shape of the curve before they remove the downward tick.
As broadly suggested in the comments by Hilbert's Drinking Problem, the real solution was to find a better distribution that fit the parameters. In my case Chi-Squared, which fit both the shape of the curve, and also the fact that it only took positive values.
However in the comments Stelios made the good suggestion of using scipy.stats.rv_histogram, which I used and was satisfied with for a while. This enabled me to fit a curve to the data exactly, though it had two problems:
1) It assumes zero value in the absence of data. I.e. if you set the
settings to fit too closely to the data, then during gaps in your
data it will drop to zero rather than interpolate.
2) As an extension
to point 1, it wont extrapolate beyond the seed data's maximum and
minimum (those data ranges are effectively giant gaps, so everything
eventually zeroes out).

Excel - 3D cartesian points - euclidean distance for a large group of points

I have a large set of XYZ Cartesian points in Excel (some 40k actually) and was looking for a formula or macro to compare every point to every other point to get the distances between them.
The math to get the distance value between two 3D points is:
Distance=SQRT((X2 – X1)^2 + (Y2 – Y1)^2 + (Z2 – Z1)^2)
X1=the X value of the 1st point
X2=the X value of the 2nd point
Y1=the Y value of the 1st point
Y2=the Y value of the 2nd point
etc
Here is an example starting with 10 points:
http://i.imgur.com/U3lchMk.jpg
Would anyone know of a way to build this into Excel so that I can just copy the formula across the page to the horizontal limit? Or would you recommend a better way than using Excel?
As a secondary goal, I want to group the points into clusters that can connect by a distance lower than 2. But if I can accomplish the first goal, I can worry about the second later.
Actually, I was able to come up with the solution with a bit more research: i.imgur.com/9JL5Qni.jpg =SQRT(((INDIRECT("A"&$D2))-(INDIRECT("A"&E$1)))^2+((INDIRECT("B"&$D2))-(INDIRECT‌​("B"&E$1)))^2+((INDIRECT("C"&$D2))-(INDIRECT("C"&E$1)))^2)

Averaging many curves with different x and y values

I have several curves that contain many data points. The x-axis is time and let's say I have n curves with data points corresponding to times on the x-axis.
Is there a way to get an "average" of the n curves, despite the fact that the data points are located at different x-points?
I was thinking maybe something like using a histogram to bin the values, but I am not sure which code to start with that could accomplish something like this.
Can Excel or MATLAB do this?
I would also like to plot the standard deviation of the averaged curve.
One concern is: The distribution amongst the x-values is not uniform. There are many more values closer to t=0, but at t=5 (for example), the frequency of data points is much less.
Another concern. What happens if two values fall within 1 bin? I assume I would need the average of these values before calculating the averaged curve.
I hope this conveys what I would like to do.
Any ideas on what code I could use (MATLAB, EXCEL etc) to accomplish my goal?
Since your series' are not uniformly distributed, interpolating prior to computing the mean is one way to avoid biasing towards times where you have more frequent samples. Note that by definition, interpolation will likely reduce the range of your values, i.e. the interpolated points aren't likely to fall exactly at the times of your measured points. This has a greater effect on the extreme statistics (e.g. 5th and 95th percentiles) rather than the mean. If you plan on going this route, you'll need the interp1 and mean functions
An alternative is to do a weighted mean. This way you avoid truncating the range of your measured values. Assuming x is a vector of measured values and t is a vector of measurement times in seconds from some reference time then you can compute the weighted mean by:
timeStep = diff(t);
weightedMean = timeStep .* x(1:end-1) / sum(timeStep);
As mentioned in the comments above, a sample of your data would help a lot in suggesting the appropriate method for calculating the "average".

Is there a standard metric for sorted text?

Given a range of numbers, say from [80,240], it is easy to determine how much of that range lies within [100,105]: (105-100)/(240-80) = 5/160 = .03125. Easy.
So now, how much of a Meriam Webster dictionary lies between umbrella and velvet? Even if we assume uniform distribution of text across the corpus, is there a standard metric for text?
I don't think there is a standard for that. If you had all entries from Meriam Webster in an array, you could use first and last positions as the bounds, so you would have a set going from 1 to n. Then you could pick the positions of "umbrella" and "velvet", call them x and y, and calculate your range as (y - x + 1) / (n).
That works if you are seeing words as elements of an ordered set, so as to have them behave as real numbers. You are basically dividing the distance between two numbers in a set by the distance between the boundaries of the set. Some forms of algebra deal with them differently - when calculating the Levenshtein distance between any two given words, for example, each words is seen as a vector with as many dimensions as they have characters.
You could define the boundaries of your n-dimensional space by using the biggest word in Meriam Webster (hint: it's "pneumonoultramicroscopicsilicovolcanoconiosis", so your space would have 45 dimensions). However, when considering any A-B pair of words, a third word C of intermediary length may or may not be between those, depending on the operations involved in the transformation from A to B.
You'd have to check every word with a length between that of A and B to check whether they are part of the range between A and B... So it's not a matter of simple calculus, and I don't know if this could be even feasible with a regular computer nowadays. And that's just considering Meriam's close to half a million entries.

Resources