Eigen and parallellization makes no difference for conjugate gradient. Precondition also fails - multithreading

This is related to this question. I have today experimented a bit with Conjugate Gradient, in particular I experimented with max_iterations and tolerance. It is faster but not fast enough. According to the documentation it should be enough to add -fopenmp in the compilation to enable multi-threading.
I have tested using both
`omp_set_num_threads(nbrThreads);
Eigen::setNbThreads(nbrThreads);`
It makes no difference in time if I use 5 threads or 1 thread, and that I think is a bit strange.
Secondly, another way to speed up the solution using pre-conditoning. When I try to do:
Eigen::ConjugateGradient<Eigen::SparseMatrix<float>, Eigen::Lower, Eigen::IncompleteCholesky<float>> cg;
I get the error:
void Eigen::IncompleteCholesky<Scalar, _UpLo, _OrderingType>::_solve_impl(const Rhs&, Dest&) const [with Rhs = Eigen::Matr
ix<float, -1, 1>; Dest = Eigen::Matrix<float, -1, 1>; Scalar = float; int _UpLo = 1; _OrderingType = Eigen::AMDOrdering<int>]: Assertion `m_factorizationIsOk && "factorize() should be called first"' failed.
Given that Eigen::SimplicialLDLT works which is a Cholesky factorization, then the incomplete should also work?
EDIT:
Here is how I call cg:
Eigen::ConjugateGradient<Eigen::SparseMatrix<float>, Eigen::Lower, Eigen::IncompleteCholesky<float>> cg;
cg.setTolerance(0.01);
cg.setMaxIterations(50);
cg.analyzePattern(A_tot);
cg.compute(A_tot);
Eigen::VectorXf opt = cg.solveWithGuess(b_tot, rho_current);
Actually when you read about IterativeSolvers here, then IncompleteCholesky is not listed. Although IncompleCholesky is defined here.

As explained in the documentation, you need to store the full matrix (both upper and lower triangular part), and pass Lower|Upper to ConjugateGradient to get multi-threading with cg.

The problem seems to be that I had an beta-version of Eigen installed. It runs with pre-conditioner using Eigen 3.3.2

Related

Can you use the built in derivative functions in compute shaders? (vulkan)

I want to use the built in derivative funcitons:
vec3 dpdx = dFdx(p);
vec3 dpdy = dFdy(p);
Inside a compute shader. However I get the following error:
Message ID name: UNASSIGNED-CoreValidation-Shader-InconsistentSpirv
Message: Validation Error: [ UNASSIGNED-CoreValidation-Shader-InconsistentSpirv ] Object 0: handle = 0x5654380d4dd8, name = Logical device: GeForce GT 1030, type = VK_OBJECT_TYPE_DEVICE; | MessageID = 0x6bbb14 | SPIR-V module not valid: OpEntryPoint Entry Point <id> '5[%main]'s callgraph contains function <id> 46[%BiplanarMapping_s21_vf3_vf3_f1_], which cannot be used with the current execution modes:
Derivative instructions require DerivativeGroupQuadsNV or DerivativeGroupLinearNV execution mode for GLCompute execution model: DPdx
Derivative instructions require DerivativeGroupQuadsNV or DerivativeGroupLinearNV execution mode for GLCompute execution model: DPdy
%BiplanarMapping_s21_vf3_vf3_f1_ = OpFunction %v4float None %41
Severity: VK_DEBUG_UTILS_MESSAGE_SEVERITY_ERROR_BIT_EXT
I don't seem to find anything on the topic when I search online.
Derivative functions only work in a fragment shader. The derivatives are based on the rate-of-change of the value across the primitive being rendered. Obviously compute shaders don't render primitives, so there is nothing to compute.
Apparently, NVIDIA has an extension that provides some derivative computation capabilities for compute shaders. That's where the weird error comes from.
Derivatives in fragment shaders are computed by subtracting between the same value from adjacent invocations. As such, you can emulate this by using shared variables.
First, you have to make sure that the spatially adjacent invocations are in the same work group. So your work group size needs to be some multiple of 2x2 invocations. Then, you need a shared variable array, which you index by invocations within a work group. Each invocation should write its own value to its own index.
To compute the derivative, issue a barrier (with memoryBarrierShared) after writing the values to the shared variables. Take the difference between one's invocation and the adjacent one in the same 2x2 quad. You should make sure that all invocations in the same quad get the same value, by always subtracting between the lower index and the higher index within the quad. Something like this:
uvec2 quadIndex = gl_LocalInvocationID.xy / 2
/*type*/ derFdX = variable[quadIndex.x + 1][quadIndex.y + 0] - variable[quadIndex.x + 0][quadIndex.y + 0]
/*type*/ derFdY = variable[quadIndex.x + 0][quadIndex.y + 1] - variable[quadIndex.x + 0][quadIndex.y + 0]
The NVIDIA extension basically does this for you, though it's probably more efficient since it wouldn't need the shared variable.

Achieving a tunable parameter in OpenModelica

I have the following Modelica code
model RocketCar
Real x;
Real v;
input Real u(min = -1, max = 1);
parameter Real h;
equation
der(x) = h*v;
der(v) = h*u;
end RocketCar;
meant to model the infamous rocket car problem, which I would like to transform into an FMU.
I transform the file using OMShell:
>>> loadFile("RocketCar.mo")
true
>> translateModelFMU(RocketCar)
"/path/to/RocketCar.fmu"
The resulting FMU lists the variable as
<ScalarVariable
name="h"
valueReference="10"
variability="fixed"
causality="parameter"
initial="exact">
<Real start="0.0"/>
</ScalarVariable>
i.e., as a fixed parameter. I would like to change the Modelica code to obtain a tunable parameter in the resulting FMU.
Note that a similar question has been asked and answered already. The suggested solution was to add annotation (Evaluate=false) to the definition of the parameter. However, this answer seems to be specific to Dymola. The suggested annotation seems to have no effect regarding the resulting FMU. Is there an OpenModelica variant of the annotation to achieve the same effect?

Node: Generate 6 digits random number using crypto.randomBytes

What is the correct way to generate exact value from 0 to 999999 randomly since 1000000 is not a power of 2?
This is my approach:
use crypto.randomBytes to generate 3 bytes and convert to hex
use the first 5 characters to convert to integer (max is fffff == 1048575 > 999999)
if the result > 999999, start from step 1 again
It will somehow create a recursive function. Is it logically correct and will it cause a concern of performance?
There are several way to extract random numbers in a range from random bits. Some common ones are described in NIST Special Publication 800-90A revision 1: Recommendation for Random Number Generation Using Deterministic Random Bit Generators
Although this standard is about deterministic random bit generations there is a helpful appendix called A.5 Converting Random Bits into a Random Number which describes three useful methods.
The methods described are:
A.5.1 The Simple Discard Method
A.5.2 The Complex Discard Method
A.5.3 The Simple Modular Method
The first two of them are not deterministic with regards to running time but generate a number with no bias at all. They are based on rejection sampling.
The complex discard method discusses a more optimal scheme for generating large quantities of random numbers in a range. I think it is too complex for almost any normal use; I would look at the Optimized Simple Discard method described below if you require additional efficiency instead.
The Simple Modular Method is time constant and deterministic but has non-zero (but negligible) bias. It requires a relatively large amount of additional randomness to achieve the negligible bias though; basically to have a bias of one out of 2^128 you need 128 bits on top of the bit size of the range required. This is probably not the method to choose for smaller numbers.
Your algorithm is clearly a version of the Simple Discard Method (more generally called "rejection sampling"), so it is fine.
I've myself thought of a very efficient algorithm based on the Simple Discard Method called the "Optimized Simple Discard Method" or RNG-BC where "BC" stands for "binary compare". It is based on the observation that comparison only looks at the most significant bits, which means that the least significant bits should still be considered random and can therefore be reused. Beware that this method has not been officially peer reviewed; I do present an informal proof of equivalence with the Simple Discard Method.
Of course you should rather use a generic method that is efficient given any value of N. In that case the Complex Discard Method or Simple Modular Method should be considered over the Simple Discard Method. There are other, much more complex algorithms that are even more efficient, but generally you're fine when using either of these two.
Note that it is often beneficial to first check if N is a power of two when generating a random in the range [0, N). If N is a power of two then there is no need to use any of these possibly expensive computations; just use the bits you need from the random bit or byte generator.
It's a correct algorithm (https://en.wikipedia.org/wiki/Rejection_sampling), though you could consider using bitwise operations instead of converting to hex. It can run forever if the random number generator is malfunctioning -- you could consider trying a fixed number of times and then throwing an exception instead of looping forever.
The main possible performance problem is that on some platforms, crypto.randomBytes can block if it runs out of entropy. So you don't want to waste any randomness if you're using it.
Therefore instead of your string comparison I would use the following integer operation.
if (random_bytes < 16700000) {
return random_bytes = random_bytes - 100000 * Math.floor(random_bytes/100000);
}
This has about a 99.54% chance of producing an answer from the first 3 bytes, as opposed to around 76% odds with your approach.
I would suggest the following approach:
private generateCode(): string {
let code: string = "";
do {
code += randomBytes(3).readUIntBE(0, 3);
// code += Number.parseInt(randomBytes(3).toString("hex"), 16);
} while (code.length < 6);
return code.slice(0, 6);
}
This returns the numeric code as string, but if it is necessary to get it as a number, then change to return Number.parseInt(code.slice(0, 6))
I call it the random_6d algo. Worst case just a single additional loop.
var random_6d = function(n2){
var n1 = crypto.randomBytes(3).readUIntLE(0, 3) >>> 4;
if(n1 < 1000000)
return n1;
if(typeof n2 === 'undefined')
return random_6d(n1);
return Math.abs(n1 - n2);
};
loop version:
var random_6d = function(){
var n1, n2;
while(true){
n1 = crypto.randomBytes(3).readUIntLE(0, 3) >>> 4;
if(n1 < 1000000)
return n1;
if(typeof n2 === 'undefined')
n2 = n1;
else
return Math.abs(n1 - n2);
};
};

Robust higher statistical moments in IDL

I'm working with noisy data in IDL, so I've been using STDDEV and robust_sigma
. There are papers on robust skewness and kurtosis, for instance [1] and [2], but are there implementations, as for standard deviation? (in IDL or maybe C?)
The documentation of http://idlastro.gsfc.nasa.gov/ftp/pro/robust/robust_sigma.pro states:
; OPTIONAL OUPTUT KEYWORD:
; GOODVEC = Vector of non-trimmed indices of the input vector
So one calls robust_sigma with an extra parameter that keeps track of the "good indices" in the data, those used to compute the robust_sigma, as opposed to those ignored in its computation.
good_indices = lonarr(width)
robo_2 = robust_sigma(data[*], GOODVEC=good_indices)
Then use (only) those good indices to compute the other moments.
robo_3 = skewness(data[good_indices])
robo_4 = kurtosis(data[good_indices])
No need for a special implementation.

Is it possible to do an algebraic curve fit with just a single pass of the sample data?

I would like to do an algebraic curve fit of 2D data points, but for various reasons - it isn't really possible to have much of the sample data in memory at once, and iterating through all of it is an expensive process.
(The reason for this is that actually I need to fit thousands of curves simultaneously based on gigabytes of data which I'm reading off disk, and which is therefore sloooooow).
Note that the number of polynomial coefficients will be limited (perhaps 5-10), so an exact fit will be extremely unlikely, but this is ok as I'm trying to find an underlying pattern in data with a lot of random noise.
I understand how one can use a genetic algorithm to fit a curve to a dataset, but this requires many passes through the sample data, and thus isn't practical for my application.
Is there a way to fit a curve with a single pass of the data, where the state that must be maintained from sample to sample is minimal?
I should add that the nature of the data is that the points may lie anywhere on the X axis between 0.0 and 1.0, but the Y values will always be either 1.0 or 0.0.
So, in Java, I'm looking for a class with the following interface:
public interface CurveFit {
public void addData(double x, double y);
public List<Double> getBestFit(); // Returns the polynomial coefficients
}
The class that implements this must not need to keep much data in its instance fields, no more than a kilobyte even for millions of data points. This means that you can't just store the data as you get it to do multiple passes through it later.
edit: Some have suggested that finding an optimal curve in a single pass may be impossible, however an optimal fit is not required, just as close as we can get it in a single pass.
The bare bones of an approach might be if we have a way to start with a curve, and then a way to modify it to get it slightly closer to new data points as they come in - effectively a form of gradient descent. It is hoped that with sufficient data (and the data will be plentiful), we get a pretty good curve. Perhaps this inspires someone to a solution.
Yes, it is a projection. For
y = X beta + error
where lowercased terms are vectors, and X is a matrix, you have the solution vector
\hat{beta} = inverse(X'X) X' y
as per the OLS page. You almost never want to compute this directly but rather use LR, QR or SVD decompositions. References are plentiful in the statistics literature.
If your problem has only one parameter (and x is hence a vector as well) then this reduces to just summation of cross-products between y and x.
If you don't mind that you'll get a straight line "curve", then you only need six variables for any amount of data. Here's the source code that's going into my upcoming book; I'm sure that you can figure out how the DataPoint class works:
Interpolation.h:
#ifndef __INTERPOLATION_H
#define __INTERPOLATION_H
#include "DataPoint.h"
class Interpolation
{
private:
int m_count;
double m_sumX;
double m_sumXX; /* sum of X*X */
double m_sumXY; /* sum of X*Y */
double m_sumY;
double m_sumYY; /* sum of Y*Y */
public:
Interpolation();
void addData(const DataPoint& dp);
double slope() const;
double intercept() const;
double interpolate(double x) const;
double correlate() const;
};
#endif // __INTERPOLATION_H
Interpolation.cpp:
#include <cmath>
#include "Interpolation.h"
Interpolation::Interpolation()
{
m_count = 0;
m_sumX = 0.0;
m_sumXX = 0.0;
m_sumXY = 0.0;
m_sumY = 0.0;
m_sumYY = 0.0;
}
void Interpolation::addData(const DataPoint& dp)
{
m_count++;
m_sumX += dp.getX();
m_sumXX += dp.getX() * dp.getX();
m_sumXY += dp.getX() * dp.getY();
m_sumY += dp.getY();
m_sumYY += dp.getY() * dp.getY();
}
double Interpolation::slope() const
{
return (m_sumXY - (m_sumX * m_sumY / m_count)) /
(m_sumXX - (m_sumX * m_sumX / m_count));
}
double Interpolation::intercept() const
{
return (m_sumY / m_count) - slope() * (m_sumX / m_count);
}
double Interpolation::interpolate(double X) const
{
return intercept() + slope() * X;
}
double Interpolation::correlate() const
{
return m_sumXY / sqrt(m_sumXX * m_sumYY);
}
Why not use a ring buffer of some fixed size (say, the last 1000 points) and do a standard QR decomposition-based least squares fit to the buffered data? Once the buffer fills, each time you get a new point you replace the oldest and re-fit. That way you have a bounded working set that still has some data locality, without all the challenges of live stream (memoryless) processing.
Are you limiting the number of polynomial coefficients (i.e. fitting to a max power of x in your polynomial)?
If not, then you don't need a "best fit" algorithm - you can always fit N data points EXACTLY to a polynomial of N coefficients.
Just use matrices to solve N simultaneous equations for N unknowns (the N coefficients of the polynomial).
If you are limiting to a max number of coefficients, what is your max?
Following your comments and edit:
What you want is a low-pass filter to filter out noise, not fit a polynomial to the noise.
Given the nature of your data:
the points may lie anywhere on the X axis between 0.0 and 1.0, but the Y values will always be either 1.0 or 0.0.
Then you don't need even a single pass, as these two lines will pass exactly through every point:
X = [0.0 ... 1.0], Y = 0.0
X = [0.0 ... 1.0], Y = 1.0
Two short line segments, unit length, and every point falls on one line or the other.
Admittedly, an algorithm to find a good curve fit for arbitrary points in a single pass is interesting, but (based on your question), that's not what you need.
Assuming that you don't know which point should belong to which curve, something like a Hough Transform might provide what you need.
The Hough Transform is a technique that allows you to identify structure within a data set. One use is for computer vision, where it allows easy identification of lines and borders within the field of sight.
Advantages for this situation:
Each point need be considered only once
You don't need to keep a data structure for each candidate line, just one (complex, multi-dimensional) structure
Processing of each line is simple
You can stop at any point and output a set of good matches
You never discard any data, so it's not reliant on any accidental locality of references
You can trade off between accuracy and memory requirements
Isn't limited to exact matches, but will highlight partial matches too.
An approach
To find cubic fits, you'd construct a 4-dimensional Hough space, into which you'd project each of your data-points. Hotspots within Hough space would give you the parameters for the cubic through those points.
You need the solution to an overdetermined linear system. The popular methods are Normal Equations (not usually recommended), QR factorization, and singular value decomposition (SVD). Wikipedia has decent explanations, Trefethen and Bau is very good. Your options:
Out-of-core implementation via the normal equations. This requires the product A'A where A has many more rows than columns (so the result is very small). The matrix A is completely defined by the sample locations so you don't have to store it, thus computing A'A is reasonably cheap (very cheap if you don't need to hit memory for the node locations). Once A'A is computed, you get the solution in one pass through your input data, but the method can be unstable.
Implement an out-of-core QR factorization. Classical Gram-Schmidt will be fastest, but you have to be careful about stability.
Do it in-core with distributed memory (if you have the hardware available). Libraries like PLAPACK and SCALAPACK can do this, the performance should be much better than 1. The parallel scalability is not fantastic, but will be fine if it's a problem size that you would even think about doing in serial.
Use iterative methods to compute an SVD. Depending on the spectral properties of your system (maybe after preconditioning) this could converge very fast and does not require storage for the matrix (which in your case has 5-10 columns each of which are the size of your input data. A good library for this is SLEPc, you only have to find a the product of the Vandermonde matrix with a vector (so you only need to store the sample locations). This is very scalable in parallel.
I believe I found the answer to my own question based on a modified version of this code. For those interested, my Java code is here.

Resources