how to differentiate two very long strings in c++? - string

I would like to solve
Levenshtein_distance this problem where length of string is too huge .
Edit2 :
As Bobah said that title is miss leading , so i had updated the title of questoin .
Initial title was how to declare 100000x100000 2-d integer in c++ ?
Content was
There is any way to declare int x[100000][100000] in c++.
When i declare it globally then compiler produces error: size of array ‘x’ is too large .
One method could be using map< pair< int , int > , int > mymap .
But allocating and deallocating takes more time . There is any other way like uisng vector<int> myvec ;

For memory blocks that large, the best approach is dynamic allocation using the operating system's facilities for adding virtual memory to the process.
However, look how large a block you are trying to allocate:
40 000 000 000 bytes
I take my previous advice back. For a block that large, the best approach is to analyze the problem and figure out a way to use less memory.

Filling the edit distance matrix can be done each row at a time. Remembering the previous row is enough to compute the current row. This observation reduces space usage from quadratic to linear. Makes sense?

Your question is very interesting, but the title is misleading.
This is what you need in terms of data model (x - first string, y - second string, * - distance matrix).
y <-- first string (scrolls from top down)
y
x x x x x x x x <- second string (scrolls from left to right)
y * * *
y * * *
y * * * <-- distance matrix (a donut) scrolls together with strings
and grows/shrinks when needed, as explained below
y
Have two relatively long (but still << N) character buffers and relatively small ( << buffers size) rectangular (start from square) distance matrix.
Make the matrix a donut - bi-dimentional ring buffer (can use the one from boost, or just std::deque).
When string fragments currently covered by the matrix are 100% match shift both buffers by one, rotate the donut around both axes, recalculating one new row/column in the distance matrix.
When match is <100% and is less than configured threshold then grow the size of the both dimensions of the donut without dropping any rows/columns and do it until either match gets above the threshold or you reach the maximum donut size. When match ratio hits the threshold from the below you need to scroll donut discarding head of x and y buffers and at the same time aligning them (only X needs moving by 1 when the distance matrix tells that X[i] does not exist in Y, but X[i+1,i+m] matches Y[j, j+m-1]).
As a result you will have a simple yet very efficient heuristic diff engine with deterministic limited memory footprint and all memory can be pre-allocated at startup so no dynamic allocation will slow it down at runtime.
Apache v2 license, in case you decide to go for it.

Related

Splitting quadratic Bezier curve into several unequal parts

I am dealing with clipping of quadratic Beziér curves. Clipping is a standard graphics task. Typically, no matter what we display on a screen, we only want to render the part that fits into the screen bounds, as an optimization.
For straight lines, there is something called Cohen-Sutherland algorithm, and a slightly extended version of this algorithm is the Sutherland–Hodgman algorithm, where the first solution is for dealing with lines and the second one for polygons.
Essentially, the algorithms split the computer screen into tik-tac-toe -like squares, where the central square is what fits on the screen, and we special case each of left/right and above/below. After, when one end of the line is right off the screen and the other is not, we replace the x coordinate for this point with the screen's max value of x, and calculate the y value for it. This becomes the new endpoint of the clipped line. Pretty simple and it works well.
With Beziér curves, the same approach can be taken, only in addition to the ends, we need to consider control points. In the case of a quadratic curve, there is only one control.
To clip the curve, we can do something very similar to Cohen-Sutherland. Only, depending on the situation, we might need to cut the original curve into up to five (5) pieces. Just like both ends of a straight line might be offscreen, while the center is visible, the same situation needs to be handled with curves, yet here we only need to deal with the hull [height] of the curve causing a mid-section to be invisible. Therefore, we might end up with two new curves, after the clipping.
Finding one of the coordinates for these curves is pretty easy. It is still the min/max coordinate for one of the axis, and the value of the other coordinate. There is prior art for this, for example even calculate x for y is a good starting point. We want to adopt the formula so vectors turn into separate x and y coordinates, but the rest is doable.
Next, however, we still have an unsolved problem these one or two new curves, are completely new quadratic curves and each will have therefore a new control point.
There is a thread at split quadratic curve into two where the author seems to be doing kind of what I need, albeit in a slightly different way. There is an accepted answer, yet I could not get the results to work.
I want to end-up with a function like:
function clipSegment(sx, sy, cx, cy, ex, ey, bounds) {
let curves: {sx, sy, cx, cy, ex, ey}[] = [];
...
return curves;
}
It should take coordinates and the bounds object that would have both min and max for both x and y coordinates. I think that Cohen-Sutherland approach with squares and bit-codes should work here just as well. We get more cases for curves, but everything is doable. My problem is the new control point coordinates. For example, we could calculating t from one of the coordinates, doing something like:
function getQuadraticPoint(t, sx, sy, cp1x, cp1y, ex, ey) {
const x = (1 - t) * (1 - t) * sx + 2 * (1 - t) * t * cp1x + t * t * ex;
const y = (1 - t) * (1 - t) * sy + 2 * (1 - t) * t * cp1y + t * t * ey;
return { x, y };
}
Once we have the new start and/or beginning, how do we get the new control points?
Some developers I found online, working on similar problems, recommended just working with t and changing the interval from t from 0 to 1 to 0 to t. This however won't work easily for Canvas 2D API. The 2D Path thing needs the control point and the end point [after the pen move to the beginning with moveTo].
I believe that the quadratic Beziér case should have a closed-form solution. Yet, I have not figured out what it is. Any ideas?

J Primes Enumeration

J will answer the n-th prime via p:n.
If I ask for the 100 millionth prime I get an almost instant answer. I cannot imagine J is sieving for that prime that quickly, but neither looking it up in a table as that table would be around 1GB in size.
There are equations giving approximations to the number of primes to a bound, but they are only approximations.
How is J finding the answer so quickly ?
J uses a table to start, then calculates
NOTE! This is speculation, based on benchmarks (shown below).
If you want to quickly try for yourself, try the following:
p:1e8 NB. near-instant
p:1e8-1 NB. noticeable pause
The low points on the graph are where J looks up the prime in a table. After that, J is calculating the value from a particular starting point so it doesn't have to calculate the entire thing. So some lucky primes will be constant time (simple table lookup) but generally there's first a table lookup, and then a calculation. But happily, it calculates starting from the previous table lookup instead of calculating the entire value.
Benchmarks
I did some benchmarking to see how p: performs on my machine (iMac i5, 16G RAM). I'm using J803. The results are interesting. I'm guessing the sawtooth pattern in the time plots (visible on the 'up to 2e5' plot) is lookup table related, while the overall log-ish shape (visible on the 'up to 1e7' plot) is CPU related.
NB. my test script
ts=:3 : 0
a=.y
while. a do.
c=.timespacex 'p:(1e4*a)' NB. 1000 times a
a=.<:a
b=.c;b
end.
}:b
)
a =: ts 200
require'plot'
plot >0{each a NB. time
plot >1{each a NB. space
(p: up to 2e5)
time
space
(p: up to 1e7)
time
space
During these runs one core was hovering around 100%:
Also, the voc page states:
Currently, arguments larger than 2^31 are tested to be prime according to a probabilistic algorithm (Miller-Rabin).
And in addition to a prime lookup table as #Mauris points out, v2.c contains this function:
static F1(jtdetmr){A z;B*zv;I d,h,i,n,wn,*wv;
RZ(w=vi(w));
wn=AN(w); wv=AV(w);
GA(z,B01,wn,AR(w),AS(w)); zv=BAV(z);
for(i=0;i<wn;++i){
n=*wv++;
if(1>=n||!(1&n)||0==n%3||0==n%5){*zv++=0; continue;}
h=0; d=n-1; while(!(1&d)){++h; d>>=1;}
if (n< 9080191)*zv++=spspd(31,n,d,h)&&spspd(73,n,d,h);
else if(n<94906266)*zv++=spspd(2 ,n,d,h)&&spspd( 7,n,d,h)&&spspd(61,n,d,h);
else *zv++=spspx(2 ,n,d,h)&&spspx( 7,n,d,h)&&spspx(61,n,d,h);
}
RE(0); R z;
} /* deterministic Miller-Rabin */

Is there a graph-drawing tool that will allow me to constrain x, and automatically lay out y?

I am looking for a tool similar to graphviz that can render graphs, but that will allow me to constrain just the x coordinate of each node. Then, the tool will automatically choose y coordinates to make the graph look neat.
Basically, I want to make a timeline.
Language / platform / rendering medium are not very important.
If you want a neat-looking graph a force-directed algorithm is going to be your best bet. One of the best ones is SFDP (developed by AT&T, included in graphviz) though I can't seem to find pseudocode or an easy implementation. I don't think there are any algorithms this specialized. Thankfully, it's easy to code your own. I'll present some pseudocode mostly lifted form Wikipedia, but with suitably one-dimensional modifications. I'll assume you have n vertices and the vector of x-positions is x, subscripted by x.i.
set all vertex velocities to (0,0)
set all vertex positions to (x.i, random)
while (KE > epsilon)
KE = 0
for each vertex v
force = (0,0)
for each vertex u != v
force = force + (0, coulomb(u, v).y)
if u is incident to v
force = force + (0, hooke(u, v).y)
v.velocity = (v.velocity + timestep * force) * damping
v.position = v.position + timestep * v.velocity
KE = KE + |v.velocity| ^ 2
here the .y denotes getting the y-component of the force. This ensures that the x-components of the positions of the vertices never change from what you set them to be. The epsilon parameter is to be set by you, and should be something small compared to what you expect KE (the kinetic energy) to be. Also, |v| denotes the magnitude of the vector v (all computations are of 2-vectors in the above, except the KE). Note I set the mass of all the nodes to be 1, but you can change that if you want.
The Hooke and Coulomb functions calculate the respective forces between nodes; the first is linear in distance between vertices, the second is quadratic, so there is a guaranteed equilibrium. These functions look something like
def hooke(u, v)
return -k * |u.position - v.position|
def coulomb(u, v)
return C * |u.position - v.position|
where again most computations are in vector form. C and k have real values but experiment to get the graph you want. This isn't usually necessary because the scaling factors will, in two dimensions, pretty much expand or contract the whole graph, but here the x-distances are set so to get a good-looking graph you will have to change the values a bit.

How do you pack a 3-floats (space vector) into 4 bytes (pixel)?

I've successfully packed floats with values in [0,1] without losing too much precision using:
byte packedVal = floatVal * 255.0f ; // [0,1] -> [0,255]
Then when I want to unpack the packedVal back into a float, I simply do
float unpacked = packedVal / 255.0f ; // [0,255] -> [0,1]
That works fine, as long as the floats are between 0 and 1.
Now here's the real deal. I'm trying to turn a 3d space vector (with 3 float components) into 4 bytes. The reason I'm doing this is because I am using a texture to store these vectors, with 1 pixel per vector. It should be something like a "normal map", (but not exactly this, you'll see why after the jump)
So there, each pixel represents a 3d space vector. Where the value is very red, the normal vector's direction is mostly +x (to the right).
So of course, normals are normalized. So they don't require a magnitude (scaling) vector. But I'm trying to store a vector with arbitrary magnitude, 1 vector per pixel.
Because textures have 4 components (rgba), I am thinking of storing a scaling vector in the w component.
Any other suggestions for packing an arbitrary sized 3 space vector, (say with upper limit on magnitude of 200 or so on each of x,y,z), into a 4-byte pixel color value?
Storing the magnitude in the 4th component sounds very reasonable. As long as the magnitude is bounded to something reasonable and not completely arbitrary.
If you want a more flexible range of magnitudes you can pre-multiply the normalized direction vector by (0.5, 1.0] when you store it, and when you unpack it multiply it by pow(2, w).
Such method is used for storing high dynamic range images - RGBM encoding (M stands for magnitude). One of it's drawbacks is wrong results from interpolation so you can't use bilinear filtering for your texture.
You can look for other options from HDR encodings: here is a small list of few most popular

Is it possible to do an algebraic curve fit with just a single pass of the sample data?

I would like to do an algebraic curve fit of 2D data points, but for various reasons - it isn't really possible to have much of the sample data in memory at once, and iterating through all of it is an expensive process.
(The reason for this is that actually I need to fit thousands of curves simultaneously based on gigabytes of data which I'm reading off disk, and which is therefore sloooooow).
Note that the number of polynomial coefficients will be limited (perhaps 5-10), so an exact fit will be extremely unlikely, but this is ok as I'm trying to find an underlying pattern in data with a lot of random noise.
I understand how one can use a genetic algorithm to fit a curve to a dataset, but this requires many passes through the sample data, and thus isn't practical for my application.
Is there a way to fit a curve with a single pass of the data, where the state that must be maintained from sample to sample is minimal?
I should add that the nature of the data is that the points may lie anywhere on the X axis between 0.0 and 1.0, but the Y values will always be either 1.0 or 0.0.
So, in Java, I'm looking for a class with the following interface:
public interface CurveFit {
public void addData(double x, double y);
public List<Double> getBestFit(); // Returns the polynomial coefficients
}
The class that implements this must not need to keep much data in its instance fields, no more than a kilobyte even for millions of data points. This means that you can't just store the data as you get it to do multiple passes through it later.
edit: Some have suggested that finding an optimal curve in a single pass may be impossible, however an optimal fit is not required, just as close as we can get it in a single pass.
The bare bones of an approach might be if we have a way to start with a curve, and then a way to modify it to get it slightly closer to new data points as they come in - effectively a form of gradient descent. It is hoped that with sufficient data (and the data will be plentiful), we get a pretty good curve. Perhaps this inspires someone to a solution.
Yes, it is a projection. For
y = X beta + error
where lowercased terms are vectors, and X is a matrix, you have the solution vector
\hat{beta} = inverse(X'X) X' y
as per the OLS page. You almost never want to compute this directly but rather use LR, QR or SVD decompositions. References are plentiful in the statistics literature.
If your problem has only one parameter (and x is hence a vector as well) then this reduces to just summation of cross-products between y and x.
If you don't mind that you'll get a straight line "curve", then you only need six variables for any amount of data. Here's the source code that's going into my upcoming book; I'm sure that you can figure out how the DataPoint class works:
Interpolation.h:
#ifndef __INTERPOLATION_H
#define __INTERPOLATION_H
#include "DataPoint.h"
class Interpolation
{
private:
int m_count;
double m_sumX;
double m_sumXX; /* sum of X*X */
double m_sumXY; /* sum of X*Y */
double m_sumY;
double m_sumYY; /* sum of Y*Y */
public:
Interpolation();
void addData(const DataPoint& dp);
double slope() const;
double intercept() const;
double interpolate(double x) const;
double correlate() const;
};
#endif // __INTERPOLATION_H
Interpolation.cpp:
#include <cmath>
#include "Interpolation.h"
Interpolation::Interpolation()
{
m_count = 0;
m_sumX = 0.0;
m_sumXX = 0.0;
m_sumXY = 0.0;
m_sumY = 0.0;
m_sumYY = 0.0;
}
void Interpolation::addData(const DataPoint& dp)
{
m_count++;
m_sumX += dp.getX();
m_sumXX += dp.getX() * dp.getX();
m_sumXY += dp.getX() * dp.getY();
m_sumY += dp.getY();
m_sumYY += dp.getY() * dp.getY();
}
double Interpolation::slope() const
{
return (m_sumXY - (m_sumX * m_sumY / m_count)) /
(m_sumXX - (m_sumX * m_sumX / m_count));
}
double Interpolation::intercept() const
{
return (m_sumY / m_count) - slope() * (m_sumX / m_count);
}
double Interpolation::interpolate(double X) const
{
return intercept() + slope() * X;
}
double Interpolation::correlate() const
{
return m_sumXY / sqrt(m_sumXX * m_sumYY);
}
Why not use a ring buffer of some fixed size (say, the last 1000 points) and do a standard QR decomposition-based least squares fit to the buffered data? Once the buffer fills, each time you get a new point you replace the oldest and re-fit. That way you have a bounded working set that still has some data locality, without all the challenges of live stream (memoryless) processing.
Are you limiting the number of polynomial coefficients (i.e. fitting to a max power of x in your polynomial)?
If not, then you don't need a "best fit" algorithm - you can always fit N data points EXACTLY to a polynomial of N coefficients.
Just use matrices to solve N simultaneous equations for N unknowns (the N coefficients of the polynomial).
If you are limiting to a max number of coefficients, what is your max?
Following your comments and edit:
What you want is a low-pass filter to filter out noise, not fit a polynomial to the noise.
Given the nature of your data:
the points may lie anywhere on the X axis between 0.0 and 1.0, but the Y values will always be either 1.0 or 0.0.
Then you don't need even a single pass, as these two lines will pass exactly through every point:
X = [0.0 ... 1.0], Y = 0.0
X = [0.0 ... 1.0], Y = 1.0
Two short line segments, unit length, and every point falls on one line or the other.
Admittedly, an algorithm to find a good curve fit for arbitrary points in a single pass is interesting, but (based on your question), that's not what you need.
Assuming that you don't know which point should belong to which curve, something like a Hough Transform might provide what you need.
The Hough Transform is a technique that allows you to identify structure within a data set. One use is for computer vision, where it allows easy identification of lines and borders within the field of sight.
Advantages for this situation:
Each point need be considered only once
You don't need to keep a data structure for each candidate line, just one (complex, multi-dimensional) structure
Processing of each line is simple
You can stop at any point and output a set of good matches
You never discard any data, so it's not reliant on any accidental locality of references
You can trade off between accuracy and memory requirements
Isn't limited to exact matches, but will highlight partial matches too.
An approach
To find cubic fits, you'd construct a 4-dimensional Hough space, into which you'd project each of your data-points. Hotspots within Hough space would give you the parameters for the cubic through those points.
You need the solution to an overdetermined linear system. The popular methods are Normal Equations (not usually recommended), QR factorization, and singular value decomposition (SVD). Wikipedia has decent explanations, Trefethen and Bau is very good. Your options:
Out-of-core implementation via the normal equations. This requires the product A'A where A has many more rows than columns (so the result is very small). The matrix A is completely defined by the sample locations so you don't have to store it, thus computing A'A is reasonably cheap (very cheap if you don't need to hit memory for the node locations). Once A'A is computed, you get the solution in one pass through your input data, but the method can be unstable.
Implement an out-of-core QR factorization. Classical Gram-Schmidt will be fastest, but you have to be careful about stability.
Do it in-core with distributed memory (if you have the hardware available). Libraries like PLAPACK and SCALAPACK can do this, the performance should be much better than 1. The parallel scalability is not fantastic, but will be fine if it's a problem size that you would even think about doing in serial.
Use iterative methods to compute an SVD. Depending on the spectral properties of your system (maybe after preconditioning) this could converge very fast and does not require storage for the matrix (which in your case has 5-10 columns each of which are the size of your input data. A good library for this is SLEPc, you only have to find a the product of the Vandermonde matrix with a vector (so you only need to store the sample locations). This is very scalable in parallel.
I believe I found the answer to my own question based on a modified version of this code. For those interested, my Java code is here.

Resources