Multithreaded sparse matrix multiplication in Matlab

Multithreaded sparse matrix multiplication in Matlab - multithreading

I am performing several matrix multiplications of an NxN sparse (~1-2%) matrix, let's call it B, with an NxM dense matrix, let's call it A (where M < N). N is large, as is M; on the order of several thousands. I am running Matlab 2013a.
Now, usually, matrix multiplications and most other matrix operations are implicitly parallelized in Matlab, i.e. they make use of multiple threads automatically.
This appears NOT to be the case if either of the matrices are sparse (see e.g. this StackOverflow discussion - with no answer for the intended question - and this largely unanswered MathWorks thread).
This is a rather unhappy surprise for me.
We can verify that multithreading has no effects for sparse matrix operations by the following code:
clc; clear all;
N = 5000; % set matrix sizes
M = 3000;
A = randn(N,M); % create dense random matrices
B = sprand(N,N,0.015); % create sparse random matrix
Bf = full(B); %create a dense form of the otherwise sparse matrix B
for i=1:3 % test for 1, 2, and 4 threads
m(i) = 2^(i-1);
maxNumCompThreads(m(i)); % set the thread count available to Matlab
tic % starts timer
y = B*A;
walltime(i) = toc; % wall clock time
speedup(i) = walltime(1)/walltime(i);
end
% display number of threads vs. speed up relative to just a single thread
[m',speedup']
This produces the following output, which illustrates that there is no difference between using 1, 2, and 4 threads for sparse operations:
threads speedup
1.0000 1.0000
2.0000 0.9950
4.0000 1.0155
If, on the other hand, I replace B by its dense form, refered to as Bf above, I get significant speedup:
threads speedup
1.0000 1.0000
2.0000 1.8894
4.0000 3.4841
(illustrating that matrix operations for dense matrices in Matlab are indeed implicitly parallelized)
So, my question: is there any way at all to access a parallelized/threaded version of matrix operations for sparse matrices (in Matlab) without converting them to dense form?
I found one old suggestion involving .mex files at MathWorks, but it seems the links are dead and not very well documented/no feedback? Any alternatives?
It seems to be a rather severe restriction of implicit parallelism functionality, since sparse matrices are abound in computationally heavy problems, and hyperthreaded functionality highly desirable in these cases.

MATLAB already uses SuiteSparse by Tim Davis for many of its operation on sparse matrices (for example see here), but neither of which I believe are multithreaded.
Usually computations on sparse matrices are memory-bound rather than CPU-bound. So even you use a multithreaded library, I doubt you will see huge benefits in terms of performance, at least not comparable to those specialized in dense matrices...
After all that the design of sparse matrices have different goals in mind than regular dense matrices, where efficient memory storage is often more important.
I did a quick search online, and found a few implementations out there:
sparse BLAS, spBLAS, PSBLAS. For instance, Intel MKL and AMD ACML do have some support for sparse matrices
cuSPARSE, CUSP, VexCL, ViennaCL, etc.. that run on the GPU.

I ended up writing my own mex file with OpenMP for multithreading. Code as follows. Don't forget to use -largeArrayDims and /openmp (or -fopenmp) flags when compiling.
#include <omp.h>
#include "mex.h"
#include "matrix.h"
#define ll long long
void omp_smm(double* A, double*B, double* C, ll m, ll p, ll n, ll* irs, ll* jcs)
{
for (ll j=0; j<p; ++j)
{
ll istart = jcs[j];
ll iend = jcs[j+1];
#pragma omp parallel for
for (ll ii=istart; ii<iend; ++ii)
{
ll i = irs[ii];
double aa = A[ii];
for (ll k=0; k<n; ++k)
{
C[i+k*m] += B[j+k*p]*aa;
}
}
}
}
void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
double *A, *B, *C; /* pointers to input & output matrices*/
size_t m,n,p; /* matrix dimensions */
A = mxGetPr(prhs[0]); /* first sparse matrix */
B = mxGetPr(prhs[1]); /* second full matrix */
mwIndex * irs = mxGetIr(prhs[0]);
mwIndex * jcs = mxGetJc(prhs[0]);
m = mxGetM(prhs[0]);
p = mxGetN(prhs[0]);
n = mxGetN(prhs[1]);
/* create output matrix C */
plhs[0] = mxCreateDoubleMatrix(m, n, mxREAL);
C = mxGetPr(plhs[0]);
omp_smm(A,B,C, m, p, n, (ll*)irs, (ll*)jcs);
}

On matlab central the same question was asked, and this answer was given:
I believe the sparse matrix code is implemented by a few specialized TMW engineers rather than an external library like BLAS/LAPACK/LINPACK/etc...
Which basically means, that you are out of luck.
However I can think of some tricks to achieve faster computations:
If you need to do several multiplications: do multiple multiplications at once and process them in parallel?
If you just want to do one multiplication: Cut the matrix into pieces (for example top half and bottom half), do the calculations of the parts in parallel and combine the results afterwards
Probably these solutions will not turn out to be as fast as properly implemented multithreading, but hopefully you can still get a speedup.

Related

matrix multiplication for complex numbers in PyTorch

I am trying to multiply two complex matrices in PyTorch and it seems the torch.matmul functions is not added yet to PyTorch library for complex numbers.
Do you have any recommendation or is there another method to multiply complex matrices in PyTorch?

Currently torch.matmul is not supported for complex tensors such as ComplexFloatTensor but you could do something as compact as the following code:
def matmul_complex(t1,t2):
return torch.view_as_complex(torch.stack((t1.real # t2.real - t1.imag # t2.imag, t1.real # t2.imag + t1.imag # t2.real),dim=2))
When possible avoid using for loops as these will result in much slower implementations.
Vectorization is achieved by using built-in methods as demonstrated in the code I have attached.
For example, your code takes roughly 6.1s on CPU while the vectorized version takes only 101ms (~60 times faster) for 2 random complex matrices with dimensions 1000 X 1000.
Update:
Since PyTorch 1.7.0 (as #EduardoReis mentioned) you can do matrix multiplication between complex matrices similarly to real-valued matrices as follows:
t1 # t2
(for t1, t2 complex matrices).

I implemented this function for pytorch.matmul for complex numbers using torch.mv and it's working fine for time-being:
def matmul_complex(t1, t2):
m = list(t1.size())[0]
n = list(t2.size())[1]
t = torch.empty((1,n), dtype=torch.cfloat)
t_total = torch.empty((m,n), dtype=torch.cfloat)
for i in range(0,n):
if i == 0:
t_total = torch.mv(t1,t2[:,i])
else:
t_total = torch.cat((t_total, torch.mv(t1,t2[:,i])), 0)
t_final = torch.reshape(t_total, (m,n))
return t_final
I am new to PyTorch, so please correct me if I am wrong.

Element-wise variance of an iterator

What's a numerically-stable way of taking the variance of an iterator elementwise? As an example, I would like to do something like
var((rand(4,2) for i in 1:10))
and get back a (4,2) matrix which is the variance in each coefficient. This throws an error using Julia's Base var. Is there a package that can handle this? Or an easy (and storage-efficient) way to do this using the Base Julia function? Or does one need to be developed on its own?

I went ahead and implemented a Welford algorithm to calculate this:
# Welford algorithm
# https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance
function componentwise_meanvar(A;bessel=true)
x0 = first(A)
n = 0
mean = zero(x0)
M2 = zero(x0)
delta = zero(x0)
delta2 = zero(x0)
for x in A
n += 1
delta .= x .- mean
mean .+= delta./n
delta2 .= x .- mean
M2 .+= delta.*delta2
end
if n < 2
return NaN
else
if bessel
M2 .= M2 ./ (n .- 1)
else
M2 .= M2 ./ n
end
return mean,M2
end
end
A few other algorithms are implemented in DiffEqMonteCarlo.jl as well. I'm surprised I couldn't find a library for this, but maybe will refactor this out someday.

See update below for a numerically stable version
Another method to calculate this:
srand(0) # reset random for comparing across implementations
moment2var(t) = (t[3]-t[2].^2./t[1])./(t[1]-1)
foldfunc(x,y) = (x[1]+1,x[2].+y,x[3].+y.^2)
moment2var(foldl(foldfunc,(0,zeros(1,1),zeros(1,1)),(rand(4,2) for i=1:10)))
Gives:
4×2 Array{Float64,2}:
0.0848123 0.0643537
0.0715945 0.0900416
0.111934 0.084314
0.0819135 0.0632765
Similar to:
srand(0) # reset random for comparing across implementations
# naive component-wise application of `var` function
map(var,zip((rand(4,2) for i=1:10)...))
which is the non-iterator version (or offline version in CS terminology).
This method is based on calculation of variance from mean and sum-of-squares. moment2var and foldfunc are just a helper functions, but it fits in one-line without them.
Comments:
Speedwise, this should be pretty good as well. Perhaps, StaticArrays and initializing the foldl's v0 with the correct eltype of the iterator would save even more time.
Benchmarking gave 5x speed advantage (and better memory usage) over componentwise_meanvar (from another answer) on a sample input.
Using moment2meanvar(t)=(t[2]./t[1],(t[3]-t[2].^2./t[1])./(t[1]-1)‌) gives both mean and variance like componentwise_meanvar.
As #ChrisRackauckas noted, this method suffers from numerical instability when number of elements to sum is large.
--- UPDATE with variant of method ---
A little abstraction of the question asks for a way to do a foldl (and reduce,foldr) on an iterator returning a matrix, element-wise and retaining shape. To do so, we can define an assisting function mfold which takes a folding-function and makes it fold matrices element-wise. Define it as follows:
mfold(f) = (x,y)->[f(t[1],t[2]) for t in zip(x,y)]
For this specific problem of variance, we can define the component-wise fold functions, and a final function to combine the moments into the variance (and mean if wanted). The code:
ff(x,y) = (x[1]+1,x[2]+y,x[3]+y^2) # fold and collect moments
moment2var(t) = (t[3]-t[2]^2/t[1])/(t[1]-1) # calc variance from moments
moment2meanvar(t) = (t[2]./t[1],(t[3]-t[2].^2./t[1])./(t[1]-1))
We can see moment2meanvar works on a single vector as follows:
julia> moment2meanvar(foldl(ff,(0.0,0.0,0.0),[1.0,2.0,3.0]))
(2.0, 1.0)
Now to matrix-ize it using foldm (using .-notation):
moment2var.(foldl(mfold(ff),fill((0,0,0),(4,2)),(rand(4,2) for i=1:10)))
#ChrisRackauckas noted this is not numerically stable, and another method (detailed in Wikipedia) is better. Using foldm this could be implemented as:
# better fold function compensating the sums for stability
ff2(x,y) = begin
delta=y-x[2]
mean=x[2]+delta/(x[1]+1)
return (x[1]+1,mean,x[3]+delta*(y-mean))
end
# combine the collected information for the variance (and mean)
m2var(t) = t[3]/(t[1]-1)
m2meanvar(t) = (t[2],t[3]/(t[1]-1))
Again we have:
m2var.(foldl(mfold(ff2),fill((0,0.0,0.0),(4,2)),(rand(4,2) for i=1:10)))
Giving the same results (perhaps a little more accurately).

Or an easy (and storage-efficient) way to do this using the Base Julia function?
Out of curiosity, why is the standard solution of using var along the external dimension not good for you?
julia> var(cat(3,(rand(4,2) for i in 1:10)...),3)
4×2×1 Array{Float64,3}:
[:, :, 1] =
0.08847 0.104799
0.0946243 0.0879721
0.105404 0.0617594
0.0762611 0.091195
Obviously, I'm using cat here, which clearly is not very storage efficient, just so I can use the Base Julia function and your original generator syntax as per your question. But you could make this storage efficient as well, if you initialise your random values directly on a preallocated array of size (4,2,10), so that's not really an issue here.
Or did I misunderstand your question?
EDIT - benchmark in response to comments
function standard_var(Y, A)
for i in 1 : length(A)
Y[:,:,i], = next(A,i);
end
var(Y,3)
end
function testit()
A = (rand(4,2) for i in 1:10000);
Y = Array{Float64, 3}(4,2,length(A));
#time componentwise_meanvar(A); # as defined in Chris's answer above
#time standard_var(Y, A) # standard variance + using preallocation
#time var(cat(3, A...), 3); # standard variance without preallocation
return nothing
end
julia> testit()
0.004258 seconds (10.01 k allocations: 1.374 MiB)
0.006368 seconds (49.51 k allocations: 2.129 MiB)
5.954470 seconds (50.19 M allocations: 2.989 GiB, 71.32% gc time)

Parallelization of Piecewise Polynomial Evaluation

I am trying to evaluate points in a large piecewise polynomial, which is obtained from a cubic-spline. This takes a long time to do and I would like to speed it up.
As such, I would like to evaluate a points on a piecewise polynomial with parallel processes, rather than sequentially.
Code:
z = zeros(1e6, 1) ; % preallocate some memory for speed
Y = rand(11220,161) ; %some data, rand for generating a working example
X = 0 : 0.0125 : 2 ; % vector of data sites
pp = spline(X, Y) ; % get the piecewise polynomial form of the cubic spline.
The resulting structure is large.
for t = 1 : 1e6 % big number
hcurrent = ppval(pp,t); %evaluate the piecewise polynomial at t
z(t) = sum(x(t:t+M-1).*hcurrent,1) ; % do some operation of the interpolated value. Most likely not relevant to this question.
end
Unfortunately, with matrix form and using:
hcurrent = flipud(ppval(pp, 1: 1e6 ))
requires too much memory to process, so cannot be done. Is there a way that I can batch process this code to speed it up?

For scalar second arguments, as in your example, you're dealing with two issues. First, there's a good amount of function call overhead and redundant computation (e.g., unmkpp(pp) is called every loop iteration). Second, ppval is written to be general so it's not fully vectorized and does a lot of things that aren't necessary in your case.
Below is vectorized code code that take advantage of some of the structure of your problem (e.g., t is an integer greater than 0), avoids function call overhead, move some calculations outside of your main for loop (at the cost of a bit of extra memory), and gets rid of a for loop inside of ppval:
n = 1e6;
z = zeros(n,1);
X = 0:0.0125:2;
Y = rand(11220,numel(X));
pp = spline(X,Y);
[b,c,l,k,dd] = unmkpp(pp);
T = 1:n;
idx = discretize(T,[-Inf b(2:l) Inf]); % Or: [~,idx] = histc(T,[-Inf b(2:l) Inf]);
x = bsxfun(#power,T-b(idx),(k-1:-1:0).').';
idx = dd*idx;
d = 1-dd:0;
for t = T
hcurrent = sum(bsxfun(#times,c(idx(t)+d,:),x(t,:)),2);
z(t) = ...;
end
The resultant code takes ~34% of the time of your example for n=1e6. Note that because of the vectorization, calculations are performed in a different order. This will result in slight differences between outputs from ppval and my optimized version due to the nature of floating point math. Any differences should be on the order of a few times eps(hcurrent). You can still try using parfor to further speed up the calculation (with four already running workers, my system took just 12% of your code's original time).
I consider the above a proof of concept. I may have over-optmized the code above if your example doesn't correspond well to your actual code and data. In that case, I suggest creating your own optimized version. You can start by looking at the code for ppval by typing edit ppval in your Command Window. You may be able to implement further optimizations by looking at the structure of your problem and what you ultimately want in your z vector.
Internally, ppval still uses histc, which has been deprecated. My code above uses discretize to perform the same task, as suggested by the documentation.

Use parfor command for parallel loops. see here, also precompute z vector as z(j) = x(j:j+M-1) and hcurrent in parfor for speed up.

The Spline Parameters estimation can be written in Matrix form.
Once you write it in Matrix form and solve it you can use the Model Matrix to evaluate the Spline on all data point using Matrix Multiplication which is probably the most tuned operation in MATLAB.

Cython: make prange parallelization thread-safe

Cython starter here. I am trying to speed up a calculation of a certain pairwise statistic (in several bins) by using multiple threads. In particular, I am using prange from cython.parallel, which internally uses openMP.
The following minimal example illustrates the problem (compilation via Jupyter notebook Cython magic).
Notebook setup:
%load_ext Cython
import numpy as np
Cython code:
%%cython --compile-args=-fopenmp --link-args=-fopenmp -a
from cython cimport boundscheck
import numpy as np
from cython.parallel cimport prange, parallel
#boundscheck(False)
def my_parallel_statistic(double[:] X, double[:,::1] bins, int num_threads):
cdef:
int N = X.shape[0]
int nbins = bins.shape[0]
double Xij,Yij
double[:] Z = np.zeros(nbins,dtype=np.float64)
int i,j,b
with nogil, parallel(num_threads=num_threads):
for i in prange(N,schedule='static',chunksize=1):
for j in range(i):
#some pairwise quantities
Xij = X[i]-X[j]
Yij = 0.5*(X[i]+X[j])
#check if in bin
for b in range(nbins):
if (Xij < bins[b,0]) or (Xij > bins[b,1]):
continue
Z[b] += Xij*Yij
return np.asarray(Z)
mock data and bins
X = np.random.rand(10000)
bin_edges = np.linspace(0.,1,11)
bins = np.array([bin_edges[:-1],bin_edges[1:]]).T
bins = bins.copy(order='C')
Timing via
%timeit my_parallel_statistic(X,bins,1)
%timeit my_parallel_statistic(X,bins,4)
yields
1 loop, best of 3: 728 ms per loop
1 loop, best of 3: 330 ms per loop
which is not a perfect scaling, but that is not the main point of the question. (But do let me know if you have suggestions beyond adding the usual decorators or fine-tuning the prange arguments.)
However, this calculation is apparently not thread-safe:
Z1 = my_parallel_statistic(X,bins,1)
Z4 = my_parallel_statistic(X,bins,4)
np.allclose(Z1,Z4)
reveals a significant difference between the two results (up to 20% in this example).
I strongly suspect that the problem is that multiple threads can do
Z[b] += Xij*Yij
at the same time. But what I don't know is how to fix this without sacrificing the speed-up.
In my actual use case, the calculation of Xij and Yij is more expensive, hence I would like to do them only once per pair. Also, pre-computing and storing Xij and Yij for all pairs and then simply looping through bins is not a good option either because N can get very large, and I can't store 100,000 x 100,000 numpy arrays in memory (this was actually the main motivation for rewriting it in Cython!).
System info (added following suggestion in comments):
CPU(s): 8
Model name: Intel(R) Core(TM) i7-4790K CPU # 4.00GHz
OS: Red Hat Linux v6.8
Memory: 16 GB

Yes, Z[b] += Xij*Yij is indeed a race condition.
There are a couple of options of making this atomic or critical. Implementation issues with Cython aside, you would in any case have bad performance due to false sharing on the shared Z vector.
So the better alternative is to reserve a private array for each thread. There are a couple of (non-)options again. One could use a private malloc'd pointer, but I wanted to stick with np. Memory slices cannot be assigned as private variables. A two dimensional (num_threads, nbins) array works, but for some reason generates very complicated inefficient array index code. This works but is slower and does not scale.
A flat numpy array with manual "2D" indexing works well. You get a little bit extra performance by avoiding padding the private parts of the array to 64 byte, which is a typical cache line size. This avoids false sharing between the cores. The private parts are simply summed up serially outside of the parallel region.
%%cython --compile-args=-fopenmp --link-args=-fopenmp -a
from cython cimport boundscheck
import numpy as np
from cython.parallel cimport prange, parallel
cimport openmp
#boundscheck(False)
def my_parallel_statistic(double[:] X, double[:,::1] bins, int num_threads):
cdef:
int N = X.shape[0]
int nbins = bins.shape[0]
double Xij,Yij
# pad local data to 64 byte avoid false sharing of cache-lines
int nbins_padded = (((nbins - 1) // 8) + 1) * 8
double[:] Z_local = np.zeros(nbins_padded * num_threads,dtype=np.float64)
double[:] Z = np.zeros(nbins)
int i,j,b, bb, tid
with nogil, parallel(num_threads=num_threads):
tid = openmp.omp_get_thread_num()
for i in prange(N,schedule='static',chunksize=1):
for j in range(i):
#some pairwise quantities
Xij = X[i]-X[j]
Yij = 0.5*(X[i]+X[j])
#check if in bin
for b in range(nbins):
if (Xij < bins[b,0]) or (Xij > bins[b,1]):
continue
Z_local[tid * nbins_padded + b] += Xij*Yij
for tid in range(num_threads):
for bb in range(nbins):
Z[bb] += Z_local[tid * nbins_padded + bb]
return np.asarray(Z)
This performs quite well on my 4 core machine, with 720 ms / 191 ms, a speedup of 3.6. The remaining gap may be due to turbo mode. I don't have access to a proper machine for testing right now.

You are right that the access to Z is under a race condition.
You might be better off defining num_threads copies of Z, as cdef double[:] Z = np.zeros((num_threads, nbins), dtype=np.float64) and perform a sum along axis 0 after the prange loop.
return np.sum(Z, axis=0)
Cython code can have a with gil statement in a parallel region but it is only documented for error handling. You could have a look at the general C code to see whether that would trigger an atomic OpenMP operation but I doubt it.

Is it possible to do an algebraic curve fit with just a single pass of the sample data?

I would like to do an algebraic curve fit of 2D data points, but for various reasons - it isn't really possible to have much of the sample data in memory at once, and iterating through all of it is an expensive process.
(The reason for this is that actually I need to fit thousands of curves simultaneously based on gigabytes of data which I'm reading off disk, and which is therefore sloooooow).
Note that the number of polynomial coefficients will be limited (perhaps 5-10), so an exact fit will be extremely unlikely, but this is ok as I'm trying to find an underlying pattern in data with a lot of random noise.
I understand how one can use a genetic algorithm to fit a curve to a dataset, but this requires many passes through the sample data, and thus isn't practical for my application.
Is there a way to fit a curve with a single pass of the data, where the state that must be maintained from sample to sample is minimal?
I should add that the nature of the data is that the points may lie anywhere on the X axis between 0.0 and 1.0, but the Y values will always be either 1.0 or 0.0.
So, in Java, I'm looking for a class with the following interface:
public interface CurveFit {
public void addData(double x, double y);
public List<Double> getBestFit(); // Returns the polynomial coefficients
}
The class that implements this must not need to keep much data in its instance fields, no more than a kilobyte even for millions of data points. This means that you can't just store the data as you get it to do multiple passes through it later.
edit: Some have suggested that finding an optimal curve in a single pass may be impossible, however an optimal fit is not required, just as close as we can get it in a single pass.
The bare bones of an approach might be if we have a way to start with a curve, and then a way to modify it to get it slightly closer to new data points as they come in - effectively a form of gradient descent. It is hoped that with sufficient data (and the data will be plentiful), we get a pretty good curve. Perhaps this inspires someone to a solution.

Yes, it is a projection. For
y = X beta + error
where lowercased terms are vectors, and X is a matrix, you have the solution vector
\hat{beta} = inverse(X'X) X' y
as per the OLS page. You almost never want to compute this directly but rather use LR, QR or SVD decompositions. References are plentiful in the statistics literature.
If your problem has only one parameter (and x is hence a vector as well) then this reduces to just summation of cross-products between y and x.

If you don't mind that you'll get a straight line "curve", then you only need six variables for any amount of data. Here's the source code that's going into my upcoming book; I'm sure that you can figure out how the DataPoint class works:
Interpolation.h:
#ifndef __INTERPOLATION_H
#define __INTERPOLATION_H
#include "DataPoint.h"
class Interpolation
{
private:
int m_count;
double m_sumX;
double m_sumXX; /* sum of X*X */
double m_sumXY; /* sum of X*Y */
double m_sumY;
double m_sumYY; /* sum of Y*Y */
public:
Interpolation();
void addData(const DataPoint& dp);
double slope() const;
double intercept() const;
double interpolate(double x) const;
double correlate() const;
};
#endif // __INTERPOLATION_H
Interpolation.cpp:
#include <cmath>
#include "Interpolation.h"
Interpolation::Interpolation()
{
m_count = 0;
m_sumX = 0.0;
m_sumXX = 0.0;
m_sumXY = 0.0;
m_sumY = 0.0;
m_sumYY = 0.0;
}
void Interpolation::addData(const DataPoint& dp)
{
m_count++;
m_sumX += dp.getX();
m_sumXX += dp.getX() * dp.getX();
m_sumXY += dp.getX() * dp.getY();
m_sumY += dp.getY();
m_sumYY += dp.getY() * dp.getY();
}
double Interpolation::slope() const
{
return (m_sumXY - (m_sumX * m_sumY / m_count)) /
(m_sumXX - (m_sumX * m_sumX / m_count));
}
double Interpolation::intercept() const
{
return (m_sumY / m_count) - slope() * (m_sumX / m_count);
}
double Interpolation::interpolate(double X) const
{
return intercept() + slope() * X;
}
double Interpolation::correlate() const
{
return m_sumXY / sqrt(m_sumXX * m_sumYY);
}

Why not use a ring buffer of some fixed size (say, the last 1000 points) and do a standard QR decomposition-based least squares fit to the buffered data? Once the buffer fills, each time you get a new point you replace the oldest and re-fit. That way you have a bounded working set that still has some data locality, without all the challenges of live stream (memoryless) processing.

Are you limiting the number of polynomial coefficients (i.e. fitting to a max power of x in your polynomial)?
If not, then you don't need a "best fit" algorithm - you can always fit N data points EXACTLY to a polynomial of N coefficients.
Just use matrices to solve N simultaneous equations for N unknowns (the N coefficients of the polynomial).
If you are limiting to a max number of coefficients, what is your max?
Following your comments and edit:
What you want is a low-pass filter to filter out noise, not fit a polynomial to the noise.

Given the nature of your data:
the points may lie anywhere on the X axis between 0.0 and 1.0, but the Y values will always be either 1.0 or 0.0.
Then you don't need even a single pass, as these two lines will pass exactly through every point:
X = [0.0 ... 1.0], Y = 0.0
X = [0.0 ... 1.0], Y = 1.0
Two short line segments, unit length, and every point falls on one line or the other.
Admittedly, an algorithm to find a good curve fit for arbitrary points in a single pass is interesting, but (based on your question), that's not what you need.

Assuming that you don't know which point should belong to which curve, something like a Hough Transform might provide what you need.
The Hough Transform is a technique that allows you to identify structure within a data set. One use is for computer vision, where it allows easy identification of lines and borders within the field of sight.
Advantages for this situation:
Each point need be considered only once
You don't need to keep a data structure for each candidate line, just one (complex, multi-dimensional) structure
Processing of each line is simple
You can stop at any point and output a set of good matches
You never discard any data, so it's not reliant on any accidental locality of references
You can trade off between accuracy and memory requirements
Isn't limited to exact matches, but will highlight partial matches too.
An approach
To find cubic fits, you'd construct a 4-dimensional Hough space, into which you'd project each of your data-points. Hotspots within Hough space would give you the parameters for the cubic through those points.

You need the solution to an overdetermined linear system. The popular methods are Normal Equations (not usually recommended), QR factorization, and singular value decomposition (SVD). Wikipedia has decent explanations, Trefethen and Bau is very good. Your options:
Out-of-core implementation via the normal equations. This requires the product A'A where A has many more rows than columns (so the result is very small). The matrix A is completely defined by the sample locations so you don't have to store it, thus computing A'A is reasonably cheap (very cheap if you don't need to hit memory for the node locations). Once A'A is computed, you get the solution in one pass through your input data, but the method can be unstable.
Implement an out-of-core QR factorization. Classical Gram-Schmidt will be fastest, but you have to be careful about stability.
Do it in-core with distributed memory (if you have the hardware available). Libraries like PLAPACK and SCALAPACK can do this, the performance should be much better than 1. The parallel scalability is not fantastic, but will be fine if it's a problem size that you would even think about doing in serial.
Use iterative methods to compute an SVD. Depending on the spectral properties of your system (maybe after preconditioning) this could converge very fast and does not require storage for the matrix (which in your case has 5-10 columns each of which are the size of your input data. A good library for this is SLEPc, you only have to find a the product of the Vandermonde matrix with a vector (so you only need to store the sample locations). This is very scalable in parallel.

I believe I found the answer to my own question based on a modified version of this code. For those interested, my Java code is here.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string