Is is possible to real time update a value with spark streaming? - statistics

Lets assume I have a stream of Double values and I want to compute the average every ten seconds. How can I have a sliding window that doesn't need to recompute the average but instead update it by, lets say, removing the part of the oldest ten seconds and adding only the new 10 seconds values?

TL;DR : use reduceByWindow with both of its function arguments (jump to the last paragraph for the code snippet)
There's two interpretations of your question, the specific one (how do I get a running mean for one hour, updated every 2 seconds), and the general one (how do I get a computation that updates state in a sparse way). Here's the answer for the general one.
First, notice there is a way to represent your data such that your average-with-updates is easy to compute, based on a windowed DStream: this represents your data as an incremental construction of the stream, with maximal sharing. But it is less efficient, computationally, to recompute the mean on each batch – as you noted.
If you do want to do an update of a complex stateful computation which is invertible, but don't want to touch the stream's construction, there is updateStateByKey – but there Spark doesn't help you in reflecting the incremental aspect of your computation in the stream, you have to manage it yourself.
Here, you do have something simple and invertible, and you don't have a notion of keys. You can use reduceByWindow with its inverse reduction argument, using the usual functions that would let you compute an incremental mean.
val myInitialDStream: DStream[Float]
val myDStreamWithCount: DStream[(Float, Long)] =
myInitialDStream.map((x) => (x, 1L))
def addOneBatchToMean(previousMean: (Float, Long), newBatch: (Float, Long)): (Float, Long) =
(previousMean._1 + newBatch._1, previousMean._2 + newBatch._2)
def removeOneBatchToMean(previousMean: (Float, Long), oldBatch: (Float, Long)): (Float, Long) =
(previousMean._1 - oldBatch._1, previousMean._2 - oldBatch._2)
val runningMeans = myDStreamWithCount.reduceByWindow(addOneBatchToMean, removeOneBatchToMean, Durations.seconds(3600), Duractions.seconds(2))
You get a stream of one-element RDDs, each of which contains a pair (m, n) where m is your running sum over the 1h-window and n the number of elements in the 1h-window. Just return (or map to) m/n to get the mean.

Related

Limiting an RDD size

I have an RDD as follows:
rdd
.filter { case (_, record) => predicates.forall(_.accept(record)) }
.toDS()
.cache()
It basically filters down an RDD after applying a predicate.
The issue I have is this... Some of my data set RDDs are massive and predicates may be empty meaning that we attempt to cache an entire data set.
Instead what I'd like to do is always limit the size of the data set before I cache it.
I've tried placing a limit as follows:
dataSet
.filter { case (_, record) => predicates.forall(_.accept(record)) }
.limit(10000)
.toDS()
.cache()
but I get OOM errors. It looks to me like the partitions are being overloaded before the limit is applied.
Therefore I'm wondering if there is some way for the limit to be applied to the partitions. So effectively filtering would be paused once we reach the limit.
Scaling out further isn't an option as these data sets are too big
You should likely look into sampling the rdd. If you provide a consistent seed you will get a consistent result. You likely don't want "withReplace". This will run faster than using limit. Sample does work on the entire data but filters as it goes reducing the data set.
RDD.sample(withReplacement, fraction, seed=None)
Parameters:
withReplacement - bool can elements be sampled multiple times
(replaced when sampled out)
fraction - float expected size of the sample as a fraction of this RDD’s
size without replacement: probability that each element is chosen;
fraction must be [0, 1] with replacement: expected number of times
each element is chosen; fraction must be >= 0
seed - int, optional seed for the random number generation
Relevant code links (rdd.sample), (subclass that does actual work work.)

Internals of reduce function in spark-shell

Input file contains 20 lines. I am trying to count total number of records using reduce function. Can anyone please explain me why there is difference in the results? Because here value of y is nothing but only 1.
Default number of partitions : 4
scala> rdd = sc.textFile("D:\LearningPythonTomaszDenny\Codebase\\wholeTextFiles\\names1.txt")
scala> rdd.map(x=>1).reduce((acc,y) => acc+1)
res17: Int = 8
scala> rdd.map(x=>1).reduce((acc,y) => acc+y)
res18: Int = 20
Because here value of y is nothing but only 1.
That is simply not true. reduce consist of three stages (not in a strict Spark meaning of the word):
Distributed reduce on each partition.
Collection of the partial results to the driver (synchronous or asynchronous depending on the backend).
Local driver reduction.
In your case the results of the first and second stage will be the same, but the first approach will simply ignore the partial results. In other words, no matter what was the result for the partition, it will always add only 1.
Such approach would work only with non-parallel, non-sequential reduce implementations.

How to find the nearest neighbors of 1 Billion records with Spark?

Given 1 Billion records containing following information:
ID x1 x2 x3 ... x100
1 0.1 0.12 1.3 ... -2.00
2 -1 1.2 2 ... 3
...
For each ID above, I want to find the top 10 closest IDs, based on Euclidean distance of their vectors (x1, x2, ..., x100).
What's the best way to compute this?
As it happens, I have a solution to this, involving combining sklearn with Spark: https://adventuresindatascience.wordpress.com/2016/04/02/integrating-spark-with-scikit-learn-visualizing-eigenvectors-and-fun/
The gist of it is:
Use sklearn’s k-NN fit() method centrally
But then use sklearn’s k-NN kneighbors() method distributedly
Performing a brute-force comparison of all records against all records is a losing battle. My suggestion would be to go for a ready-made implementation of k-Nearest Neighbor algorithm such as the one provided by scikit-learn then broadcast the resulting arrays of indices and distances and go further.
Steps in this case would be:
1- vectorize the features as Bryce suggested and let your vectorizing method return a list (or numpy array) of floats with as many elements as your features
2- fit your scikit-learn nn to your data:
nbrs = NearestNeighbors(n_neighbors=10, algorithm='auto').fit(vectorized_data)
3- run the trained algorithm on your vectorized data (training and query data are the same in your case)
distances, indices = nbrs.kneighbors(qpa)
Steps 2 and 3 will run on your pyspark node and are not parallelizable in this case. You will need to have enough memory on this node. In my case with 1.5 Million records and 4 features, it took a second or two.
Until we get a good implementation of NN for spark I guess we would have to stick to these workarounds. If you'd rather like to try something new, then go for http://spark-packages.org/package/saurfang/spark-knn
You haven't provided a lot of detail, but the general approach I would take to this problem would be to:
Convert the records to a data structure like like a LabeledPoint with (ID, x1..x100) as label and features
Map over each record and compare that record to all the other records (lots of room for optimization here)
Create some cutoff logic so that once you start comparing ID = 5 with ID = 1 you interrupt the computation because you have already compared ID = 1 with ID = 5
Some reduce step to get a data structure like {id_pair: [1,5], distance: 123}
Another map step to find the 10 closest neighbors of each record
You've identified pyspark and I generally do this type of work using scala, but some pseudo code for each step might look like:
# 1. vectorize the features
def vectorize_raw_data(record)
arr_of_features = record[1..99]
LabeledPoint( record[0] , arr_of_features)
# 2,3 + 4 map over each record for comparison
broadcast_var = []
def calc_distance(record, comparison)
# here you want to keep a broadcast variable with a list or dictionary of
# already compared IDs and break if the key pair already exists
# then, calc the euclidean distance by mapping over the features of
# the record and subtracting the values then squaring the result, keeping
# a running sum of those squares and square rooting that sum
return {"id_pair" : [1,5], "distance" : 123}
for record in allRecords:
for comparison in allRecords:
broadcast_var.append( calc_distance(record, comparison) )
# 5. map for 10 closest neighbors
def closest_neighbors(record, n=10)
broadcast_var.filter(x => x.id_pair.include?(record.id) ).takeOrdered(n, distance)
The psuedocode is terrible, but I think it communicates the intent. There will be a lot of shuffling and sorting here as you are comparing all records with all other records. IMHO, you want to store the keypair/distance in a central place (like a broadcast variable that gets updated though this is dangerous) to reduce the total euclidean distance calculations you perform.

how to differentiate two very long strings in c++?

I would like to solve
Levenshtein_distance this problem where length of string is too huge .
Edit2 :
As Bobah said that title is miss leading , so i had updated the title of questoin .
Initial title was how to declare 100000x100000 2-d integer in c++ ?
Content was
There is any way to declare int x[100000][100000] in c++.
When i declare it globally then compiler produces error: size of array ‘x’ is too large .
One method could be using map< pair< int , int > , int > mymap .
But allocating and deallocating takes more time . There is any other way like uisng vector<int> myvec ;
For memory blocks that large, the best approach is dynamic allocation using the operating system's facilities for adding virtual memory to the process.
However, look how large a block you are trying to allocate:
40 000 000 000 bytes
I take my previous advice back. For a block that large, the best approach is to analyze the problem and figure out a way to use less memory.
Filling the edit distance matrix can be done each row at a time. Remembering the previous row is enough to compute the current row. This observation reduces space usage from quadratic to linear. Makes sense?
Your question is very interesting, but the title is misleading.
This is what you need in terms of data model (x - first string, y - second string, * - distance matrix).
y <-- first string (scrolls from top down)
y
x x x x x x x x <- second string (scrolls from left to right)
y * * *
y * * *
y * * * <-- distance matrix (a donut) scrolls together with strings
and grows/shrinks when needed, as explained below
y
Have two relatively long (but still << N) character buffers and relatively small ( << buffers size) rectangular (start from square) distance matrix.
Make the matrix a donut - bi-dimentional ring buffer (can use the one from boost, or just std::deque).
When string fragments currently covered by the matrix are 100% match shift both buffers by one, rotate the donut around both axes, recalculating one new row/column in the distance matrix.
When match is <100% and is less than configured threshold then grow the size of the both dimensions of the donut without dropping any rows/columns and do it until either match gets above the threshold or you reach the maximum donut size. When match ratio hits the threshold from the below you need to scroll donut discarding head of x and y buffers and at the same time aligning them (only X needs moving by 1 when the distance matrix tells that X[i] does not exist in Y, but X[i+1,i+m] matches Y[j, j+m-1]).
As a result you will have a simple yet very efficient heuristic diff engine with deterministic limited memory footprint and all memory can be pre-allocated at startup so no dynamic allocation will slow it down at runtime.
Apache v2 license, in case you decide to go for it.

Is it possible to do an algebraic curve fit with just a single pass of the sample data?

I would like to do an algebraic curve fit of 2D data points, but for various reasons - it isn't really possible to have much of the sample data in memory at once, and iterating through all of it is an expensive process.
(The reason for this is that actually I need to fit thousands of curves simultaneously based on gigabytes of data which I'm reading off disk, and which is therefore sloooooow).
Note that the number of polynomial coefficients will be limited (perhaps 5-10), so an exact fit will be extremely unlikely, but this is ok as I'm trying to find an underlying pattern in data with a lot of random noise.
I understand how one can use a genetic algorithm to fit a curve to a dataset, but this requires many passes through the sample data, and thus isn't practical for my application.
Is there a way to fit a curve with a single pass of the data, where the state that must be maintained from sample to sample is minimal?
I should add that the nature of the data is that the points may lie anywhere on the X axis between 0.0 and 1.0, but the Y values will always be either 1.0 or 0.0.
So, in Java, I'm looking for a class with the following interface:
public interface CurveFit {
public void addData(double x, double y);
public List<Double> getBestFit(); // Returns the polynomial coefficients
}
The class that implements this must not need to keep much data in its instance fields, no more than a kilobyte even for millions of data points. This means that you can't just store the data as you get it to do multiple passes through it later.
edit: Some have suggested that finding an optimal curve in a single pass may be impossible, however an optimal fit is not required, just as close as we can get it in a single pass.
The bare bones of an approach might be if we have a way to start with a curve, and then a way to modify it to get it slightly closer to new data points as they come in - effectively a form of gradient descent. It is hoped that with sufficient data (and the data will be plentiful), we get a pretty good curve. Perhaps this inspires someone to a solution.
Yes, it is a projection. For
y = X beta + error
where lowercased terms are vectors, and X is a matrix, you have the solution vector
\hat{beta} = inverse(X'X) X' y
as per the OLS page. You almost never want to compute this directly but rather use LR, QR or SVD decompositions. References are plentiful in the statistics literature.
If your problem has only one parameter (and x is hence a vector as well) then this reduces to just summation of cross-products between y and x.
If you don't mind that you'll get a straight line "curve", then you only need six variables for any amount of data. Here's the source code that's going into my upcoming book; I'm sure that you can figure out how the DataPoint class works:
Interpolation.h:
#ifndef __INTERPOLATION_H
#define __INTERPOLATION_H
#include "DataPoint.h"
class Interpolation
{
private:
int m_count;
double m_sumX;
double m_sumXX; /* sum of X*X */
double m_sumXY; /* sum of X*Y */
double m_sumY;
double m_sumYY; /* sum of Y*Y */
public:
Interpolation();
void addData(const DataPoint& dp);
double slope() const;
double intercept() const;
double interpolate(double x) const;
double correlate() const;
};
#endif // __INTERPOLATION_H
Interpolation.cpp:
#include <cmath>
#include "Interpolation.h"
Interpolation::Interpolation()
{
m_count = 0;
m_sumX = 0.0;
m_sumXX = 0.0;
m_sumXY = 0.0;
m_sumY = 0.0;
m_sumYY = 0.0;
}
void Interpolation::addData(const DataPoint& dp)
{
m_count++;
m_sumX += dp.getX();
m_sumXX += dp.getX() * dp.getX();
m_sumXY += dp.getX() * dp.getY();
m_sumY += dp.getY();
m_sumYY += dp.getY() * dp.getY();
}
double Interpolation::slope() const
{
return (m_sumXY - (m_sumX * m_sumY / m_count)) /
(m_sumXX - (m_sumX * m_sumX / m_count));
}
double Interpolation::intercept() const
{
return (m_sumY / m_count) - slope() * (m_sumX / m_count);
}
double Interpolation::interpolate(double X) const
{
return intercept() + slope() * X;
}
double Interpolation::correlate() const
{
return m_sumXY / sqrt(m_sumXX * m_sumYY);
}
Why not use a ring buffer of some fixed size (say, the last 1000 points) and do a standard QR decomposition-based least squares fit to the buffered data? Once the buffer fills, each time you get a new point you replace the oldest and re-fit. That way you have a bounded working set that still has some data locality, without all the challenges of live stream (memoryless) processing.
Are you limiting the number of polynomial coefficients (i.e. fitting to a max power of x in your polynomial)?
If not, then you don't need a "best fit" algorithm - you can always fit N data points EXACTLY to a polynomial of N coefficients.
Just use matrices to solve N simultaneous equations for N unknowns (the N coefficients of the polynomial).
If you are limiting to a max number of coefficients, what is your max?
Following your comments and edit:
What you want is a low-pass filter to filter out noise, not fit a polynomial to the noise.
Given the nature of your data:
the points may lie anywhere on the X axis between 0.0 and 1.0, but the Y values will always be either 1.0 or 0.0.
Then you don't need even a single pass, as these two lines will pass exactly through every point:
X = [0.0 ... 1.0], Y = 0.0
X = [0.0 ... 1.0], Y = 1.0
Two short line segments, unit length, and every point falls on one line or the other.
Admittedly, an algorithm to find a good curve fit for arbitrary points in a single pass is interesting, but (based on your question), that's not what you need.
Assuming that you don't know which point should belong to which curve, something like a Hough Transform might provide what you need.
The Hough Transform is a technique that allows you to identify structure within a data set. One use is for computer vision, where it allows easy identification of lines and borders within the field of sight.
Advantages for this situation:
Each point need be considered only once
You don't need to keep a data structure for each candidate line, just one (complex, multi-dimensional) structure
Processing of each line is simple
You can stop at any point and output a set of good matches
You never discard any data, so it's not reliant on any accidental locality of references
You can trade off between accuracy and memory requirements
Isn't limited to exact matches, but will highlight partial matches too.
An approach
To find cubic fits, you'd construct a 4-dimensional Hough space, into which you'd project each of your data-points. Hotspots within Hough space would give you the parameters for the cubic through those points.
You need the solution to an overdetermined linear system. The popular methods are Normal Equations (not usually recommended), QR factorization, and singular value decomposition (SVD). Wikipedia has decent explanations, Trefethen and Bau is very good. Your options:
Out-of-core implementation via the normal equations. This requires the product A'A where A has many more rows than columns (so the result is very small). The matrix A is completely defined by the sample locations so you don't have to store it, thus computing A'A is reasonably cheap (very cheap if you don't need to hit memory for the node locations). Once A'A is computed, you get the solution in one pass through your input data, but the method can be unstable.
Implement an out-of-core QR factorization. Classical Gram-Schmidt will be fastest, but you have to be careful about stability.
Do it in-core with distributed memory (if you have the hardware available). Libraries like PLAPACK and SCALAPACK can do this, the performance should be much better than 1. The parallel scalability is not fantastic, but will be fine if it's a problem size that you would even think about doing in serial.
Use iterative methods to compute an SVD. Depending on the spectral properties of your system (maybe after preconditioning) this could converge very fast and does not require storage for the matrix (which in your case has 5-10 columns each of which are the size of your input data. A good library for this is SLEPc, you only have to find a the product of the Vandermonde matrix with a vector (so you only need to store the sample locations). This is very scalable in parallel.
I believe I found the answer to my own question based on a modified version of this code. For those interested, my Java code is here.

Resources