Optimizing parameter input for functions - visual-c++

I have a function that has a large number if individual input parameters, the function is also run hundreds of millions of times during the program.
If i wanted to optimize this function, should I create a new data structure that holds all the input parameters and pass it to the function by reference instead of passing each parameter individually to the function? Or would it not matter because the compiler is smart enough to deal wit this in an even more efficient manner?

In general, it's much better to pass a data structure that contains your variables. This isn't pretty to look at or use:
void f(int a, int b, int c, int d, int e, int f)
{
// do stuff
}
This is much nicer:
void f(Params p)
{
// do stuff with p
}
You may want to do pass by reference, so the compiler can just pass a reference to the object, and not copy the entire data structure.
As a real example:
double distance(double x1, double y1, double z1, double x2, double y2, double z2)
{
double dx = x1 - x2;
double dy = y1 - y2;
double dz = z1 - z2;
return sqrt(dx*dx + dy*dy + dz*dz);
}
It would be better if encapsulated our (x, y, z) into a data structure though:
struct Point
{
double x;
double y;
double z;
};
double distance(const Point &p1, const Point &p2)
{
double dx = p1.x - p2.x;
double dy = p1.y - p2.y;
double dz = p1.z - p2.z;
return sqrt(dx*dx + dy*dy + dz*dz);
}
Much cleaner code, and you get the added bonus that it could* perform better (*Depending on how smart your compiler is at optimizing either version).
Obviously this could vary wildly depending on what you're actually trying to accomplish, but if you have several (4+) variables that have a similar usage in a certain context, it may be better to just pass it in a data structure.

Are the arguments mostly constant, or do most of them change on every call? You don't want to evaluate arguments many times if you could only do them once.
Keep in mind what the compiler does with arguments.
It evaluates each one and pushes it on the stack. Then the function is entered, and it refers to those arguments by their offset in the stack. So it is basically the same as if you put the arguments into a block and passed the block. However, if you build the block yourself, you may be able to re-use old values and only evaluate the ones you know have changed.
In any case, you really have to look at how much work goes on inside the function relative to the time spent passing arguments to it. It's irrelevant that you're calling it 10^8 times without knowing over what overall time span. That could be 10ns per call, or 10ms per call. If the latter, nearly all the time is spent inside the function, so it probably doesn't make much difference how you call it.

Related

Robust linear interpolation

Given two segment endpoints A and B (in two dimensions), I would like to perform linear interpolation based on a value t, i.e.:
C = A + t(B-A)
In the ideal world, A, B and C should be collinear. However, we are operating with limited floating-point here, so there will be small deviations. To work around numerical issues with other operations I am using robust adaptive routines originally created by Jonathan Shewchuk. In particular, Shewchuk implements an orientation function orient2d that uses adaptive precision to exactly test the orientation of three points.
Here my question: is there a known procedure how the interpolation can be computed using the floating-point math, so that it lies exactly on the line between A and B? Here, I care less about the accuracy of the interpolation itself and more about the resulting collinearity. In another terms, its ok if C is shifted around a bit as long as collinearity is satisfied.
The bad news
The request can't be satisfied. There are values of A and B for which there is NO value of t other than 0 and 1 for which lerp(A, B, t) is a float.
A trivial example in single precision is x1 = 12345678.f and x2 = 12345679.f. Regardless of the values of y1 and y2, the required result must have an x component between 12345678.f and 12345679.f, and there's no single-precision float between these two.
The (sorta) good news
The exact interpolated value, however, can be represented as the sum of 5 floating-point values (vectors in the case of 2D): one for the formula's result, one for the error in each operation [1] and one for multiplying the error by t. I'm not sure if that will be useful to you. Here's a 1D C version of the algorithm in single precision that uses fused multiply-add to calculate the product error, for simplicity:
#include <math.h>
float exact_sum(float a, float b, float *err)
{
float sum = a + b;
float z = sum - a;
*err = a - (sum - z) + (b - z);
return sum;
}
float exact_mul(float a, float b, float *err)
{
float prod = a * b;
*err = fmaf(a, b, -prod);
return prod;
}
float exact_lerp(float A, float B, float t,
float *err1, float *err2, float *err3, float *err4)
{
float diff = exact_sum(B, -A, err1);
float prod = exact_mul(diff, t, err2);
*err1 = exact_mul(*err1, t, err4);
return exact_sum(A, prod, err3);
}
In order for this algorithm to work, operations need to conform to IEEE-754 semantics in round-to-nearest mode. That's not guaranteed by the C standard, but the GNU gcc compiler can be instructed to do so, at least in processors supporting SSE2 [2][3].
It is guaranteed that the arithmetic addition of (result + err1 + err2 + err3 + err4) will be equal to the desired result; however, there is no guarantee that the floating-point addition of these quantities will be exact.
To use the above example, exact_lerp(12345678.f, 12345679.f, 0.300000011920928955078125f, &err1, &err2, &err3, &err4) returns a result of 12345678.f and err1, err2, err3 and err4 are 0.0f, 0.0f, 0.300000011920928955078125f and 0.0f respectively. Indeed, the correct result is 12345678.300000011920928955078125 which can't be represented as a single-precision float.
A more convoluted example: exact_lerp(0.23456789553165435791015625f, 7.345678806304931640625f, 0.300000011920928955078125f, &err1, &err2, &err3, &err4) returns 2.3679010868072509765625f and the errors are 6.7055225372314453125e-08f, 8.4771045294473879039287567138671875e-08f, 1.490116119384765625e-08f and 2.66453525910037569701671600341796875e-15f. These numbers add up to the exact result, which is 2.36790125353468550173374751466326415538787841796875 and can't be exactly stored in a single-precision float.
All numbers in the examples above are written using their exact values, rather than a number that approximates to them. For example, 0.3 can't be represented exactly as a single-precision float; the closest one has an exact value of 0.300000011920928955078125 which is the one I've used.
It might be possible that if you calculate err1 + err2 + err3 + err4 + result (in that order), you get an approximation that is considered collinear in your use case. Perhaps worth a try.
References
[1] Graillat, Stef (2007). Accurate Floating Point Product and Exponentiation.
[2] Enabling strict floating point mode in GCC
[3] Semantics of Floating Point Math in GCC

Define swizzling programmatically (as in GLSL)

How would one write swizzling as a defined behaviour in a programming language? (swizzling members like matrices and vectors in GLSL) So if I wanted to make a programming language that would allow the definition of swizzling on some members, what would be a good way to do it? So for example I could do this:
struct
{
swizzable
{
float x, float y, float z, float w
}
}
But this is missing a lot. For example it does not define that what sould it return when I swizzle more or less elements or assign to a subset or just the elements backwards. Like in GLSL I can do v.xyz to create a Vec3 from a Vec4 called v. Or I could assign a subset of members: v.zyx = ... in any order.
So this swizzable substruct is not a solution (or at least too limited). Another way would be to return an array of swizzled members and an implicit cast (with a constructor) would generate the wanted element:
struct Vec2
{
swizzable { float x, float y }
Vec2(float[2] elements)
{ x = elements[0]; y = elements[1]; }
}
struct Vec3
{
swizzable { float x, float y, float z }
}
So if I accessed a Vec3's x and y via swizzling, I would get a float[2] and because I have a constructor for Vec2, I can assign this array to it (and implicitly instantiating a vec2).
This looks like a better solution but still: How could one do better?
Edit: Sorry I didn't specify the question: I want to implement a programming language that supports this kind of thing.
I'm not sure how to give a good, detailed answer, so here is just one idea.
If I understand right, swizzling is mainly a syntactic convenience. The page https://www.opengl.org/wiki/GLSL_Optimizations gives the following example of swizzling:
gl_FragColor = mycolor.xyzw * constantList.xxxy + constantList.yyyx;
This could simply be syntactic shorthand for something like:
gl_FragColor = Vector(mycolor.x, mycolor.y, mycolor.z, mycolor.w)
* Vector(constantList.x, constantList.x, constantList.x, constantList.y)
+ Vector(constantList.y, constantList.y, constantList.y, constantList.x);
So, one step may be to figure out how to parse the shorter syntax and interpret it as meaning something similar to the longer syntax.
I don't see why it would be necessary to declare the struct as anything more complicated than struct myStruct { float x, float y, float z, float w }. The language itself should be able to handle all the details of how to implement this swizzling.

Ray Trace Refraction looks "fake"

So in my ray tracer that I'm building I've gotten refraction to work for spheres along with caustic effects, however the glass spheres don't look particularly good. I believe the refracted math is correct because the light appears to be bending in inverting the way you'd expect it to however it doesn't look like glass, it just looks like paper or something.
I've read that total internal reflection is responsible for much of what makes glass look like it does, however when I tested to see if any of my refracted rays go above the critical angle, none of them do, so my glass sphere has no total internal reflection. I'm not sure if this is normal or if I've done something wrong. I've posted my refraction code below so if anyone has any suggestions I'd love to hear them.
/*
* Parameter 'dir' lets you know whether the ray is starting
* from outside the sphere going in (0) or from in the sphere
* going back out (1).
*/
void Ray::Refract(Intersection *hit, int dir)
{
float n1, n2;
if(dir == 0){ n1 = 1.0; n2 = hit->mat->GetRefract(); }
if(dir == 1){ n1 = hit->mat->GetRefract(); n2 = 1.0; }
STVector3 N = hit->normal/hit->normal.Length();
if(dir == 1) N = -N;
STVector3 V = D/D.Length();
double c1 = -STVector3::Dot(N, V);
double n = n1/n2;
double c2 = sqrt(1.0f - (n*n)*(1.0f - (c1*c1)));
STVector3 Rr = (n * V) + (n * c1 - c2) * N;
/*These are the parameters of the current ray being updated*/
E = hit->point; //Starting point
D = Rr; //Direction
}
This method is called during my main ray-tracing method RayTrace() which runs recursively. Here's a small section of it below that is in charge of refraction:
if (hit->mat->IsRefractive())
{
temp.Refract(hit, dir); //Temp is my ray that is being refracted
dir++;
result += RayTrace(temp, dir); //Result is the return RGB value.
}
You're right, you'll never get total internal reflection in a sphere (seen from the outside). That's because of symmetry: a ray inside a sphere will hit the surface at the same angle at both ends, meaning that, if it exceeds the critical angle at one end, it would have to exceed it at the other one as well (and so would not have been able to enter the sphere from the outside in the first place).
However, you'll still get a lot of partial reflection according to Fresnel's law. It doesn't look like your code accounts for that, which is probably why your glass looks fake. Try including that and see if it helps.
(Yes, it means that your rays will split in two whenever they hit a refractive surface. That happens in reality, so you'll just have to live with it. You can either trace both paths, or, if you're using a randomized ray tracing algorithm anyway, just sample one of them randomly with the appropriate weights.)
Ps. If you don't want to deal with stuff like light polarization, you may want to just use Schlick's approximation to the Fresnel equations.

Is it possible to do an algebraic curve fit with just a single pass of the sample data?

I would like to do an algebraic curve fit of 2D data points, but for various reasons - it isn't really possible to have much of the sample data in memory at once, and iterating through all of it is an expensive process.
(The reason for this is that actually I need to fit thousands of curves simultaneously based on gigabytes of data which I'm reading off disk, and which is therefore sloooooow).
Note that the number of polynomial coefficients will be limited (perhaps 5-10), so an exact fit will be extremely unlikely, but this is ok as I'm trying to find an underlying pattern in data with a lot of random noise.
I understand how one can use a genetic algorithm to fit a curve to a dataset, but this requires many passes through the sample data, and thus isn't practical for my application.
Is there a way to fit a curve with a single pass of the data, where the state that must be maintained from sample to sample is minimal?
I should add that the nature of the data is that the points may lie anywhere on the X axis between 0.0 and 1.0, but the Y values will always be either 1.0 or 0.0.
So, in Java, I'm looking for a class with the following interface:
public interface CurveFit {
public void addData(double x, double y);
public List<Double> getBestFit(); // Returns the polynomial coefficients
}
The class that implements this must not need to keep much data in its instance fields, no more than a kilobyte even for millions of data points. This means that you can't just store the data as you get it to do multiple passes through it later.
edit: Some have suggested that finding an optimal curve in a single pass may be impossible, however an optimal fit is not required, just as close as we can get it in a single pass.
The bare bones of an approach might be if we have a way to start with a curve, and then a way to modify it to get it slightly closer to new data points as they come in - effectively a form of gradient descent. It is hoped that with sufficient data (and the data will be plentiful), we get a pretty good curve. Perhaps this inspires someone to a solution.
Yes, it is a projection. For
y = X beta + error
where lowercased terms are vectors, and X is a matrix, you have the solution vector
\hat{beta} = inverse(X'X) X' y
as per the OLS page. You almost never want to compute this directly but rather use LR, QR or SVD decompositions. References are plentiful in the statistics literature.
If your problem has only one parameter (and x is hence a vector as well) then this reduces to just summation of cross-products between y and x.
If you don't mind that you'll get a straight line "curve", then you only need six variables for any amount of data. Here's the source code that's going into my upcoming book; I'm sure that you can figure out how the DataPoint class works:
Interpolation.h:
#ifndef __INTERPOLATION_H
#define __INTERPOLATION_H
#include "DataPoint.h"
class Interpolation
{
private:
int m_count;
double m_sumX;
double m_sumXX; /* sum of X*X */
double m_sumXY; /* sum of X*Y */
double m_sumY;
double m_sumYY; /* sum of Y*Y */
public:
Interpolation();
void addData(const DataPoint& dp);
double slope() const;
double intercept() const;
double interpolate(double x) const;
double correlate() const;
};
#endif // __INTERPOLATION_H
Interpolation.cpp:
#include <cmath>
#include "Interpolation.h"
Interpolation::Interpolation()
{
m_count = 0;
m_sumX = 0.0;
m_sumXX = 0.0;
m_sumXY = 0.0;
m_sumY = 0.0;
m_sumYY = 0.0;
}
void Interpolation::addData(const DataPoint& dp)
{
m_count++;
m_sumX += dp.getX();
m_sumXX += dp.getX() * dp.getX();
m_sumXY += dp.getX() * dp.getY();
m_sumY += dp.getY();
m_sumYY += dp.getY() * dp.getY();
}
double Interpolation::slope() const
{
return (m_sumXY - (m_sumX * m_sumY / m_count)) /
(m_sumXX - (m_sumX * m_sumX / m_count));
}
double Interpolation::intercept() const
{
return (m_sumY / m_count) - slope() * (m_sumX / m_count);
}
double Interpolation::interpolate(double X) const
{
return intercept() + slope() * X;
}
double Interpolation::correlate() const
{
return m_sumXY / sqrt(m_sumXX * m_sumYY);
}
Why not use a ring buffer of some fixed size (say, the last 1000 points) and do a standard QR decomposition-based least squares fit to the buffered data? Once the buffer fills, each time you get a new point you replace the oldest and re-fit. That way you have a bounded working set that still has some data locality, without all the challenges of live stream (memoryless) processing.
Are you limiting the number of polynomial coefficients (i.e. fitting to a max power of x in your polynomial)?
If not, then you don't need a "best fit" algorithm - you can always fit N data points EXACTLY to a polynomial of N coefficients.
Just use matrices to solve N simultaneous equations for N unknowns (the N coefficients of the polynomial).
If you are limiting to a max number of coefficients, what is your max?
Following your comments and edit:
What you want is a low-pass filter to filter out noise, not fit a polynomial to the noise.
Given the nature of your data:
the points may lie anywhere on the X axis between 0.0 and 1.0, but the Y values will always be either 1.0 or 0.0.
Then you don't need even a single pass, as these two lines will pass exactly through every point:
X = [0.0 ... 1.0], Y = 0.0
X = [0.0 ... 1.0], Y = 1.0
Two short line segments, unit length, and every point falls on one line or the other.
Admittedly, an algorithm to find a good curve fit for arbitrary points in a single pass is interesting, but (based on your question), that's not what you need.
Assuming that you don't know which point should belong to which curve, something like a Hough Transform might provide what you need.
The Hough Transform is a technique that allows you to identify structure within a data set. One use is for computer vision, where it allows easy identification of lines and borders within the field of sight.
Advantages for this situation:
Each point need be considered only once
You don't need to keep a data structure for each candidate line, just one (complex, multi-dimensional) structure
Processing of each line is simple
You can stop at any point and output a set of good matches
You never discard any data, so it's not reliant on any accidental locality of references
You can trade off between accuracy and memory requirements
Isn't limited to exact matches, but will highlight partial matches too.
An approach
To find cubic fits, you'd construct a 4-dimensional Hough space, into which you'd project each of your data-points. Hotspots within Hough space would give you the parameters for the cubic through those points.
You need the solution to an overdetermined linear system. The popular methods are Normal Equations (not usually recommended), QR factorization, and singular value decomposition (SVD). Wikipedia has decent explanations, Trefethen and Bau is very good. Your options:
Out-of-core implementation via the normal equations. This requires the product A'A where A has many more rows than columns (so the result is very small). The matrix A is completely defined by the sample locations so you don't have to store it, thus computing A'A is reasonably cheap (very cheap if you don't need to hit memory for the node locations). Once A'A is computed, you get the solution in one pass through your input data, but the method can be unstable.
Implement an out-of-core QR factorization. Classical Gram-Schmidt will be fastest, but you have to be careful about stability.
Do it in-core with distributed memory (if you have the hardware available). Libraries like PLAPACK and SCALAPACK can do this, the performance should be much better than 1. The parallel scalability is not fantastic, but will be fine if it's a problem size that you would even think about doing in serial.
Use iterative methods to compute an SVD. Depending on the spectral properties of your system (maybe after preconditioning) this could converge very fast and does not require storage for the matrix (which in your case has 5-10 columns each of which are the size of your input data. A good library for this is SLEPc, you only have to find a the product of the Vandermonde matrix with a vector (so you only need to store the sample locations). This is very scalable in parallel.
I believe I found the answer to my own question based on a modified version of this code. For those interested, my Java code is here.

Microsoft.DirectX.Vector3.Normalize() inconsistency

Two ways to normalize a Vector3 object; by calling Vector3.Normalize() and the other by normalizing from scratch:
class Tester {
static Vector3 NormalizeVector(Vector3 v)
{
float l = v.Length();
return new Vector3(v.X / l, v.Y / l, v.Z / l);
}
public static void Main(string[] args)
{
Vector3 v = new Vector3(0.0f, 0.0f, 7.0f);
Vector3 v2 = NormalizeVector(v);
Debug.WriteLine(v2.ToString());
v.Normalize();
Debug.WriteLine(v.ToString());
}
}
The code above produces this:
X: 0
Y: 0
Z: 1
X: 0
Y: 0
Z: 0.9999999
Why?
(Bonus points: Why Me?)
Look how they implemented it (e.g. in asm).
Maybe they wanted to be faster and produced something like:
l = 1 / v.length();
return new Vector3(v.X * l, v.Y * l, v.Z * l);
to trade 2 divisions against 3 multiplications (because they thought mults were faster than divs (which is for modern fpus most often not valid)). This introduced one level more of operation, so the less precision.
This would be the often cited "premature optimization".
Don't care about this. There's always some error involved when using floats. If you're curious, try changing to double and see if this still happens.
You should expect this when using floats, the basic reason being that the computer processes in binary and this doesn't map exactly to decimal.
For an intuitive example of issues between different bases consider the fraction 1/3. It cannot be represented exactly in Decimal (it's 0.333333.....) but can be in Terniary (as 0.1).
Generally these issues are a lot less obvious with doubles, at the expense of computing costs (double the number of bits to manipulate). However in view of the fact that a float level of precision was enough to get man to the moon then you really shouldn't obsess :-)
These issues are sort of computer theory 101 (as opposed to programming 101 - which you're obviously well beyond), and if your heading towards Direct X code where similar things can come up regularly I'd suggest it might be a good idea to pick up a basic computer theory book and read it quickly.
You have here an interesting discussion about String formatting of floats.
Just for reference:
Your number requires 24 bits to be represented, which means that you are using up the whole mantissa of a float (23bits + 1 implied bit).
Single.ToString () is ultimately implemented by a native function, so I cannot tell for sure what is going on, but my guess is that it uses the last digit to round the whole mantissa.
The reason behind this could be that you often get numbers that cannot be represented exactly in binary, so you would get a long mantissa; for instance, 0.01 is represented internally as 0.00999... as you can see by writing:
float f = 0.01f;
Console.WriteLine ("{0:G}", f);
Console.WriteLine ("{0:G}", (double) f);
by rounding at the seventh digit, you will get back "0.01", which is what you would have expected.
For what seen above, numbers with only 7 digits will not show this problem, as you already saw.
Just to be clear: the rounding is taking place only when you convert your number to a string: your calculations, if any, will use all the available bits.
Floats have a precision of 7 digits externally (9 internally), so if you go above that then rounding (with potential quirks) is automatic.
If you drop the float down to 7 digits (for instance, 1 to the left, 6 to the right) then it will work out and the string conversion will as well.
As for the bonus points:
Why you ? Because this code was 'eager to blow on you'.
(Vulcan... blow... ok.
Lamest.
Punt.
Ever)
If your code is broken by minute floating point rounding errors, then I'm afraid you need to fix it, as they're just a fact of life.

Resources