Types with Math.net numerics

Types with Math.net numerics - math.net

I am starting to use the Math.net numerics library and I can't find examples, so I'm running into a few issues:
To make a simple example, I have two arrays of doubles. I want to divide one by the other and then calculate the moving average.
So, the code looks like this:
var VD1 = Vector<double>.Build.Dense(Data1.ToArray());
var VD2 = Vector<double>.Build.Dense(Data2.ToArray());
var R = VD1 / VD2;
var SMA = R.MovingAverage(15);
The problem is that, on the way, the data type changes. It starts as 2 Vectors, the division result is a Vector and the SMA result is not, it's an IEnumerable<double>.
So, now if I want to plug that result into more functions, for example multiply it by another array, I can't. I have to rebuild a Vector from the result.
Am I somehow doing this wrong? I can't imagine that the API would bounce back and forth between different but similar types.

You are doing it right. That is how MathNet is designed. E.g., var R = VD1 / VD2; calls
// Summary: Pointwise divides two Vectors.
public static Vector<T> operator /(Vector<T> dividend, Vector<T> divisor);
and returns Vector<T>.
var SMA = R.MovingAverage(15); calls
public static IEnumerable<double> MovingAverage(this IEnumerable<double> samples, int windowSize);
and returns IEnumerable<double>.
You can call MovingAverage with Vector<double> R, because Vector<double> implements IEnumerable<double> and you get implicit casting. But MovingAverage does not know its argument is Vector<double>, it's designed to return IEnumerable<double>.
And that makes sense. As far as I remember from colledge, moving average is about time series and it has no explicit relationship to vectors.
But you can have some workarounds. For example your own overload for MovingAverage:
static class VectorHeplper
{
public static Vector<double> MovingAverage(this Vector<double> samples, int windowSize)
{
return DenseVector.OfEnumerable(samples.AsEnumerable().MovingAverage(windowSize));
}
}
Then var SMA = R.MovingAverage(15); is Vector<double>.
Anyway, building a new instance of Vector is the right and logical way.

Related

Finding closest triangle from a point using octree

I have a list of triangles in 3D space and a point described with (x,y,z) coordinates. I am writing a method for returning the closest triangle to that point.
The naive implementation I wrote initially was to loop through all the triangles, check the distance from that point and then return the one with the minimum distance. In most cases though the list of triangles I am working with consists of thousands or tens of thousands of elements, so I am looking at ways of optimising it.
I have been trying to make it work using an octree structure, so I have created an octree that stores all the triangles. I thought that a possible approach would be to find the closest cell of the octree from that point by calculating the distance between the point and the center of each cell, and then just comparing with the triangles inside that cell.
I am not sure though of how to retrieve the closest cell from the octree (it's the first time I'm using an octree). This is the method I have written so far:
public Octree getClosestCell(final Vec3D point) {
if (children != null) {
float minDist = Float.MAX_VALUE;
Octree closestCell = null;
for (int i = 0; i < 8; i++) {
final float dist = point.distanceTo(children[i].getCentroid());
if (dist < minDist) {
minDist = dist;
closestCell = children[i];
}
}
return closestCell.getClosestCell(point);
} else {
return this;
}
}
So to sum up, I have 2 questions:
Does the suggested approach sound like a good solution for optimising this problem?
Does the method above seem correct or there is a better way of retrieving the closest cell?

JTS : distance between two geometries bypassing another one in the middle

Let's say that I want to calculate the distance between two geometries with JTS, but there is another one in the middle that I can't go across (as if it was a wall). It could look like this :
I wonder how I could calculate that.
In this case, these shapes geom1 and geom2 are 38.45 meters away, as I calculate it straight away. But if I don't want go across that line, I should surround it by the Northern sides, and distance would probably be more than 70 meters away.
We can think that we could have a line a polygon or whatever in the middle.
I wonder if there is any built in function in JTS, or some other thing I could you. I guess if there is anything out there, I should check for some other workaround, as trying to solve complex routing problems is beyond my knowledge.
This is the straight away piece of code using JTS for the distance, which would not still take into account the Geometry in the middle.
import org.apache.log4j.Logger;
import com.vividsolutions.jts.geom.Geometry;
import com.vividsolutions.jts.io.ParseException;
import com.vividsolutions.jts.io.WKTReader;
public class distanceTest {
private final static Logger logger = Logger.getLogger("distanceTest");
public static void main(String [] args) {
//Projection : EPSG:32631
// We build one of the geometries on one side
String sGeom1="POLYGON ((299621.3240601513 5721036.003245114, 299600.94820609683 5721085.042327096, 299587.7719688322 5721052.9152064435, 299621.3240601513 5721036.003245114))";
Geometry geom1=distanceTest.buildGeometry(sGeom1);
// We build the geometry on the other side
String sGeom2=
"POLYGON ((299668.20990794065 5721092.766132105, 299647.3623194871 5721073.557249224, 299682.8494029705 5721049.148841454, 299668.20990794065 5721092.766132105))";
Geometry geom2=distanceTest.buildGeometry(sGeom2);
// There is a geometry in the middle, as if it was a wall
String split=
"LINESTRING (299633.6804935104 5721103.780167559, 299668.99872434285 5720999.981241705, 299608.8457218057 5721096.601805294)";
Geometry splitGeom=distanceTest.buildGeometry(split);
// We calculate the distance not taking care of the wall in the middle
double distance = geom1.distance(geom2);
logger.error("Distance : " + distance);
}
public static Geometry buildGeometry(final String areaWKT) {
final WKTReader fromText = new WKTReader();
Geometry area;
try {
area = fromText.read(areaWKT);
}
catch (final ParseException e) {
area = null;
}
return area;
}
}

This works for SQL, I hope you have the same or similar methods at your disposal.
In theory, in this instance you could create a ConvexHull containing the two geometries AND your "unpassable" geometry.
Geometry convexHull = sGeom1.STUnion(sGeom2).STUnion(split).STConvexHull();
Next, extract the border of the ConvexHull to a linestring (use STGeometry(1) - I think).
Geometry convexHullBorder = convexHull.STGeometry(1);
EDIT: Actually, with Geometry you can use STExteriorRing().
Geometry convexHullBorder = convexHull.STExteriorRing();
Lastly, pick one of your geometries, and for each shared point with the border of the ConvexHull, walk the border from that point until you reach the first point that is shared with the other geometry, adding the distance between the current and previous point at each point reached. If the second point you hit belongs to the same geometry as you are walking from, exit the loop and move on to the next to reduce time. Repeat for the second geometry.
When you've done this for all possibilities, you can simply take the minimum value (there will be only two - Geom1 to Geom2 and Geom2 to Geom1) and there is your answer.
Of course, there are plenty of scenarios in which this is too simple, but if all scenarios simply have one "wall" in them, it will work.
Some ideas of where it will not work:
The "wall" is a polygon, fully enveloping both geometries - but then how would you ever get there anyway?
There are multiple "walls" which do not intersect each other (gaps between them) - this method will ignore those passes in between "walls". If however multiple "walls" intersect, creating essentially one larger "wall" the theory will still work.
Hope that makes sense?
EDIT: Actually, upon further reflection there are other scenarios where the ConvexHull approach will not work, for instance the shape of your polygon could cause the ConvexHull to not produce the shortest path between geometries and your "walls". This will not get you 100% accuracy.

Statistical String Comparison

I am looking for a method to compare string similarity. Specifically, given two addresses I would like a measure of their similarity.
E.G.
Given 8219 Lime Forest Blvd
and 8219 Lime Forst Boulevard
The output of the comparison should give me an idea of how similar the strings are.

Levenshtein distance is way to go. Just out of box idea - two addresses can be different a lot (one can be postal code, another one street with number) and a lot of money were spend to create awesome geocoding services (like https://developers.google.com/maps/documentation/geocoding/?hl=cs). So alternative approach would be to calculate longitude/latitude for both addresses via geocoding service and see if the latitude/longitude matches :)

you could use something like this
import org.apache.commons.lang.StringUtils;
public class StringComparison {
/**
* #param args
*/
public static void main(String[] args) {
String s1 = "8219 Lime Forest Blvd";
String s2 = "8219 Lime Forst Boulevard";
//number of chars that differ
int distance = StringUtils.getLevenshteinDistance(s1, s2);
//"relative" difference
float d = (float)distance / (float)s1.length();
System.out.println(d);
}
}
getLevenshteinDistance will give you a number of chars that differ from s1 to s2.
I think it's more useful if you divide this number by the string length (careful with division by zero) and try to manually find a sweet spot where the difference is small enough to detect the same address (for me, this is usually around 20~30%)
This example is in JAVA, the lib used is at http://commons.apache.org/proper/commons-lang/index.html
Also, you could improve this just replacing known abbreviations and trying with them too.

Chebyshev LPF introducing noise

I have created a simple Chebyshev low pass filter based on coefficients generated by this site: http://www-users.cs.york.ac.uk/~fisher/mkfilter/, which I am using to filter out frequencies above 4kHz in an 16kHz sample rate audio signal before downsampling to 8kHz. Here's my code (which is C#, but this question is not C# specific, feel free to use other languages in different languages).
/// <summary>
/// Chebyshev, lowpass, -0.5dB ripple, order 4, 16kHz sample rte, 4kHz cutoff
/// </summary>
class ChebyshevLpf4Pole
{
const int NZEROS = 4;
const int NPOLES = 4;
const float GAIN = 1.403178626e+01f;
private float[] xv = new float[NZEROS+1];
private float[] yv = new float[NPOLES + 1];
public float Filter(float inValue)
{
xv[0] = xv[1]; xv[1] = xv[2]; xv[2] = xv[3]; xv[3] = xv[4];
xv[4] = inValue / GAIN;
yv[0] = yv[1]; yv[1] = yv[2]; yv[2] = yv[3]; yv[3] = yv[4];
yv[4] = (xv[0] + xv[4]) + 4 * (xv[1] + xv[3]) + 6 * xv[2]
+ (-0.1641503452f * yv[0]) + (0.4023376691f * yv[1])
+ (-0.9100943707f * yv[2]) + (0.5316388226f * yv[3]);
return yv[4];
}
}
To test it I created a sine wave "chirp" from 20Hz to 8kHz using Audacity. The test signal looks like this:
After filtering it I get:
The waveform shows that the filter is indeed reducing the amplitude of frequencies above 4kHz, but I have a load of noise added to my signal. This seems to be the case whichever of the filter types I try to implement (e.g. Butterworth, Raised Cosine etc).
Am I doing something wrong, or do these filters simply introduce artefacts at other frequencies? If I downsample using the naive approach of averaging every pair of samples, I don't get this noise at all (but obviously the aliasing is much worse).

OK, it was me being really stupid. The creation of my LPF was happening inside a processing loop instead of outside, meaning that every 512 samples I was creating a new one meaning I was losing the saved state. With just one instance of my filter running over the whole file, the noise goes away, and as expected I get aliased frequencies as the filter cannot completely remove everything above the cutoff.

I checked your filter-code in Mathematica and it works fine here without introducing noise, so probably the noise comes from some other part of your code.

It's possible that you have numerical stability problems, particularly if any of the poles are close to the unit circle. Try making all your intermediate terms double precision and then cast back to single precision at the end. I'm not too familiar with C# but in C this would be:
yv[4] = (float)(((double)xv[0] + (double)xv[4]) + 4.0 * ((double)xv[1] + (double)xv[3]) + 6.0 * xv[2]
+ (-0.1641503452 * (double)yv[0]) + (0.4023376691 * (double)yv[1])
+ (-0.9100943707 * (double)yv[2]) + (0.5316388226 * (double)yv[3]));

You haven't properly initialized your xv and yv arrays before using them for the first time. In most languages this means their values are undefined which may lead to unexpected results like yours. Initializing them to a proper value (like 0) may solve your issue.

Is it possible to do an algebraic curve fit with just a single pass of the sample data?

I would like to do an algebraic curve fit of 2D data points, but for various reasons - it isn't really possible to have much of the sample data in memory at once, and iterating through all of it is an expensive process.
(The reason for this is that actually I need to fit thousands of curves simultaneously based on gigabytes of data which I'm reading off disk, and which is therefore sloooooow).
Note that the number of polynomial coefficients will be limited (perhaps 5-10), so an exact fit will be extremely unlikely, but this is ok as I'm trying to find an underlying pattern in data with a lot of random noise.
I understand how one can use a genetic algorithm to fit a curve to a dataset, but this requires many passes through the sample data, and thus isn't practical for my application.
Is there a way to fit a curve with a single pass of the data, where the state that must be maintained from sample to sample is minimal?
I should add that the nature of the data is that the points may lie anywhere on the X axis between 0.0 and 1.0, but the Y values will always be either 1.0 or 0.0.
So, in Java, I'm looking for a class with the following interface:
public interface CurveFit {
public void addData(double x, double y);
public List<Double> getBestFit(); // Returns the polynomial coefficients
}
The class that implements this must not need to keep much data in its instance fields, no more than a kilobyte even for millions of data points. This means that you can't just store the data as you get it to do multiple passes through it later.
edit: Some have suggested that finding an optimal curve in a single pass may be impossible, however an optimal fit is not required, just as close as we can get it in a single pass.
The bare bones of an approach might be if we have a way to start with a curve, and then a way to modify it to get it slightly closer to new data points as they come in - effectively a form of gradient descent. It is hoped that with sufficient data (and the data will be plentiful), we get a pretty good curve. Perhaps this inspires someone to a solution.

Yes, it is a projection. For
y = X beta + error
where lowercased terms are vectors, and X is a matrix, you have the solution vector
\hat{beta} = inverse(X'X) X' y
as per the OLS page. You almost never want to compute this directly but rather use LR, QR or SVD decompositions. References are plentiful in the statistics literature.
If your problem has only one parameter (and x is hence a vector as well) then this reduces to just summation of cross-products between y and x.

If you don't mind that you'll get a straight line "curve", then you only need six variables for any amount of data. Here's the source code that's going into my upcoming book; I'm sure that you can figure out how the DataPoint class works:
Interpolation.h:
#ifndef __INTERPOLATION_H
#define __INTERPOLATION_H
#include "DataPoint.h"
class Interpolation
{
private:
int m_count;
double m_sumX;
double m_sumXX; /* sum of X*X */
double m_sumXY; /* sum of X*Y */
double m_sumY;
double m_sumYY; /* sum of Y*Y */
public:
Interpolation();
void addData(const DataPoint& dp);
double slope() const;
double intercept() const;
double interpolate(double x) const;
double correlate() const;
};
#endif // __INTERPOLATION_H
Interpolation.cpp:
#include <cmath>
#include "Interpolation.h"
Interpolation::Interpolation()
{
m_count = 0;
m_sumX = 0.0;
m_sumXX = 0.0;
m_sumXY = 0.0;
m_sumY = 0.0;
m_sumYY = 0.0;
}
void Interpolation::addData(const DataPoint& dp)
{
m_count++;
m_sumX += dp.getX();
m_sumXX += dp.getX() * dp.getX();
m_sumXY += dp.getX() * dp.getY();
m_sumY += dp.getY();
m_sumYY += dp.getY() * dp.getY();
}
double Interpolation::slope() const
{
return (m_sumXY - (m_sumX * m_sumY / m_count)) /
(m_sumXX - (m_sumX * m_sumX / m_count));
}
double Interpolation::intercept() const
{
return (m_sumY / m_count) - slope() * (m_sumX / m_count);
}
double Interpolation::interpolate(double X) const
{
return intercept() + slope() * X;
}
double Interpolation::correlate() const
{
return m_sumXY / sqrt(m_sumXX * m_sumYY);
}

Why not use a ring buffer of some fixed size (say, the last 1000 points) and do a standard QR decomposition-based least squares fit to the buffered data? Once the buffer fills, each time you get a new point you replace the oldest and re-fit. That way you have a bounded working set that still has some data locality, without all the challenges of live stream (memoryless) processing.

Are you limiting the number of polynomial coefficients (i.e. fitting to a max power of x in your polynomial)?
If not, then you don't need a "best fit" algorithm - you can always fit N data points EXACTLY to a polynomial of N coefficients.
Just use matrices to solve N simultaneous equations for N unknowns (the N coefficients of the polynomial).
If you are limiting to a max number of coefficients, what is your max?
Following your comments and edit:
What you want is a low-pass filter to filter out noise, not fit a polynomial to the noise.

Given the nature of your data:
the points may lie anywhere on the X axis between 0.0 and 1.0, but the Y values will always be either 1.0 or 0.0.
Then you don't need even a single pass, as these two lines will pass exactly through every point:
X = [0.0 ... 1.0], Y = 0.0
X = [0.0 ... 1.0], Y = 1.0
Two short line segments, unit length, and every point falls on one line or the other.
Admittedly, an algorithm to find a good curve fit for arbitrary points in a single pass is interesting, but (based on your question), that's not what you need.

Assuming that you don't know which point should belong to which curve, something like a Hough Transform might provide what you need.
The Hough Transform is a technique that allows you to identify structure within a data set. One use is for computer vision, where it allows easy identification of lines and borders within the field of sight.
Advantages for this situation:
Each point need be considered only once
You don't need to keep a data structure for each candidate line, just one (complex, multi-dimensional) structure
Processing of each line is simple
You can stop at any point and output a set of good matches
You never discard any data, so it's not reliant on any accidental locality of references
You can trade off between accuracy and memory requirements
Isn't limited to exact matches, but will highlight partial matches too.
An approach
To find cubic fits, you'd construct a 4-dimensional Hough space, into which you'd project each of your data-points. Hotspots within Hough space would give you the parameters for the cubic through those points.

You need the solution to an overdetermined linear system. The popular methods are Normal Equations (not usually recommended), QR factorization, and singular value decomposition (SVD). Wikipedia has decent explanations, Trefethen and Bau is very good. Your options:
Out-of-core implementation via the normal equations. This requires the product A'A where A has many more rows than columns (so the result is very small). The matrix A is completely defined by the sample locations so you don't have to store it, thus computing A'A is reasonably cheap (very cheap if you don't need to hit memory for the node locations). Once A'A is computed, you get the solution in one pass through your input data, but the method can be unstable.
Implement an out-of-core QR factorization. Classical Gram-Schmidt will be fastest, but you have to be careful about stability.
Do it in-core with distributed memory (if you have the hardware available). Libraries like PLAPACK and SCALAPACK can do this, the performance should be much better than 1. The parallel scalability is not fantastic, but will be fine if it's a problem size that you would even think about doing in serial.
Use iterative methods to compute an SVD. Depending on the spectral properties of your system (maybe after preconditioning) this could converge very fast and does not require storage for the matrix (which in your case has 5-10 columns each of which are the size of your input data. A good library for this is SLEPc, you only have to find a the product of the Vandermonde matrix with a vector (so you only need to store the sample locations). This is very scalable in parallel.

I believe I found the answer to my own question based on a modified version of this code. For those interested, my Java code is here.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string