algorithm for grouping data based on values

algorithm for grouping data based on values - statistics

I have a series of plots looking like this:
The raw data looks like:
dataPoint_1, dataPoint_2,dataPoint_3,...
23, 22, 56, ...
14, 13, 68, ...
In above diagram, some data points have values close to each other (red, pink, black...) and some are far away from others (green, blue...) And data is keep coming, so the lines are growing longer. Is there an algorithm that can help me find which data points (lines in the diagram) are close and which are not? Not sure whether some statistics algorithms fit into this problem ?

Try Euclidean. Clearly, the difference between these series is substantial.
You can also try DTW (Dynamic Time Warping) but I'm not sure it adds much here.

Related

Detecting Spikes in a 1-D discrete time series data with unknown underlying distribution

I have a discrete 1-D data set with a value range of 0-100. The underlying distribution is unknown --although we have enough data to fit a model-- to summarize it is a highly right-skewed data set, with a vast majority of values between 0-5.
My goal is to detect spikes in this data set. Spikes such as [0, 0, 0, 20, 15, 5, 0, ...]. One problem is that although spikes peak very visibly, they do not fade out as sharply. For example, in the previous example, I like to detect [20, 15] part as a single spike, although many mechanisms give them as two distinct "outlier points".
I do not have a strong statistical background, I am a systems engineer.
What are the steps to follow in this scenario?
Thank you for your help,

how visualize multi channel of feature from PyTorch?

I'm almost newbie at PyTorch
One of my output size from conv is [1, 25, 8, 32]
(25=channel, 8=height, 32=width)
I can use squeeze and make it to [25, 8, 32].
But I'm confused with 25 channel.
When I want to visualize sum of 25 channel and make to one GRAYorRGB image(1or3x8x32),How can i deal with in code??
I can use matplot or tensorboardX for visualizing..

It is difficult to visualize images with more than 3 channels and it is unclear what a feature vector in 25 dimensional space actually looks like.
The most straight forward approach would be to visualize the 8x32 feature maps you have as separate 25 gray scale images of size 8x32. Each image will show how how "sensitive" is a specific neuron/conv filter/channel (these are all equivalent) to the input at a certain spatial location.
There are more intricate methods for feature visualization, you can find more details about them in this blog post.

Fitting multiple curves to one data set

I have a data set that I receive from an outside source, and have no real control over.
The data, when plotted, shows two clumps of points with several sparse, irrelevant points. Here is a sample plot:
There is a clump of points on the left, clustered around (1, 16). This clump is actually part of a set of points that lies on (or near to) a line stretching from (1, 17.5) to (2.4, 13).
There is also an apparent curve from (1.75, 18) to (2.75, 12.5).
Finally, there are some sparse points above the second curve, around (2.5, 17).
Visually, it's not difficult to separate these groups of points. However, I need to separate these points within the data file into three groups, which I'll call Line, Curve, and Other (the Curve group is the one I actually need). I'd like to write a program that can do this reasonably well without needing to visually see the plot.
Now, I'm going to add a couple items that make this much worse. This is only a sample set of data. While the shapes of the curve and line are relatively constant from one data set to the next, the positions are not. These regions can (and do) shift, both horizontally and vertically. The only real constant is that there's a negative-slope line from the top-left to the bottom-right of the plot, an almost curve from the top-center to the bottom-right, and most of the sparse points are in the top-right corner, above the curve.
I'm on Linux, and I'm out of ideas. I can tell you the approaches that I've tried, though they have not done well.
First, I cleaned up the data set and sorted it in ascending order by x-coordinate. I thought that maybe the points were sorted in some sort of a logical way that would allow me to 'head' or 'tail' the data to achieve the desired result, but this was not the case.
I can write a code in anything (Python, Fortran, C, etc.) that removes a point if it's not within X distance of the previous point. This would be just fine, except that the scattering of the points is such that two points very near each other in x, are separated by an appreciable distance in y. It also doesn't help that the Line and Curve draw near one another for larger x-values.
I can fit a curve to a partial data set. When I sort the data by x-coordinate, for example, I can choose to only plot the first 30 points, or the last 200, or some set of 40 in the middle somewhere. That's not a problem. But the Line points tuck underneath the Curve points, which causes a problem.
If the Line points were fairly constant (which they're not), I could rotate my plot by some angle so that the Line is vertical and I can just look at the points to the right of that line, then rotate back. This may the best way to go about doing this, but in order to do that, I need to be able to isolate the linear points, which is more or less the essence of the problem.
The other idea that seems plausible to me, is to try to identify point density and split the data into separate files by those parameters. I think this is the best candidate for this problem, since it is independent of point location. However, I'm not sure how to go about doing this, especially because the Line and Curve do come quite close together for larger x-values (In the sample plot, it's x-values greater than about 2).
I know this does not exactly fall in with the request of a MWE, but I don't know how I'd go about providing a more classical MWE. If there's something else I can provide that would help, please ask. Thank you in advance.

D3 - Difference between basis and linear interpolation in SVG line

I implemented a multi-series line chart like the one given here by M. Bostock and ran into a curious issue which I cannot explain myself. When I choose linear interpolation and set my scales and axis everything is correct and values are well-aligned.
But when I change my interpolation to basis, without any modification of my axis and scales, values between the lines and the axis are incorrect.
What is happening here? With the monotone setting I can achieve pretty much the same effect as the basis interpolation but without the syncing problem between lines and axis. Still I would like to understand what is happening.

The basis interpolation is implementing a beta spline, which people like to use as an interpolation function precisely because it smooths out extreme peaks. This is useful when you are modeling something you expect to vary smoothly but only have sharp, infrequently sampled data. A consequence of this is that resulting line will not connect all data points, changing the appearance of extreme values.
In your case, the sharp peaks are the interesting features, the exception to the typically 0 baseline value. When you use a spline interpolation, you are smoothing over these peaks.
Here is a fun demo to play with the different types of line interpoations:
http://bl.ocks.org/mbostock/4342190
You can drag the data around so they resemble a sharp peak like yours, even click to add new points. Then, switch to a basis interpolation and watch the peak get averaged out.

scaling/shifting experimental data vectors using Mathematica

I have some vectors of experimental data that I need to massage, for example:
{
{0, 61237, 131895, 194760, 249935},
{0, 61939, 133775, 197516, 251018},
{0, 60919, 131391, 194112, 231930},
{0, 60735, 131015, 193584, 249607},
{0, 61919, 133631, 197186, 250526},
{0, 61557, 132847, 196143, 258687},
{0, 61643, 133011, 196516, 249891},
{0, 62137, 133947, 197848, 251106}
}
Each vector is the result of one run and consists of five numbers, which are times at which an object passes each of five sensors. Over the measurement interval the object's speed is constant (the sensor-to-sensor intervals are different because the sensor spacings are not uniform). From one run to the next the sensors' spacing remains the same, but the object's speed will vary a bit from one run to the next.
If the sensors were perfect, each vector ought to simply be a scalar multiple of any other vector (in proportion to the ratio of their speeds). But in reality each sensor will have some "jitter" and trigger early or late by some small random amount. I am trying to analyze how good the sensors themselves are, i.e. how much "jitter" is there in the measurements they give me?
So I think I need to do the following. To each vector I must scale it, and add then shift the vector a bit (adding or subtracting a fixed amount to each of its five elements). Then the StandardDeviation of each column will describe the amount of "noise" or "jitter" in that sensor. The amount that each vector is scaled, and the amount each vector is shifted, has to be chosen to minimize the standard deviation of columns.
It seemed to me that Mathematica probably has a good toolkit for getting this done, in fact I thought I might have found the answer with Standardize[] but it seems to be oriented towards processing a list of scalars, not a list of lists like I have (or at least I can't figure out to apply it to my case here).
So I am looking for some hints toward which library function(s) I might use to solve this problems, or perhaps the hint I might need to cleave the algorithm myself. Perhaps part of my problem is that I can't figure out where to look - is what I have here is a "signal processing" problem, or a data manipulation or data mining problem, or a minimization problem, or maybe it's a relatively standard statistical function that I simply haven't heard of before?
(As a bonus I would like to be able to control the weighting function used to optimize this scaling/shifting; e.g. in my data above I suspect that sensor#5 is having problems so I would like to fit the data to only consider the SDs of sensors 1-4 when doing the scaling/shifting)

I can't comment much on your algorithm itself, as data analysis is not my forte. However, from what I understand, you're trying to characterize the timing variations in each sensor. Since the data from a single sensor is in a single column of your matrix, I'd suggest transposing it and mapping Standardize on to each set of data. In other words,
dat = (* your data *)
Standardize /# Transpose[dat]
To put it back in columnar form, Transpose the result. To exclude you last sensor from this process, simply use Part ([ ]) and Span (;;)
Standardize /# Transpose[dat][[ ;; -2 ]]
Or, Most
Standardize /# Most[Transpose[dat]]
Thinking about it, I think you're going to have a hard time separating out the timing jitter from variation in velocity. Can you intentionally vary the velocity?

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string