Time series anomaly detection - statistics

I would like to get suggestions about a time series problem. The data is about strain gauge on the wing of flight which is measured using different sensors. Basically, we are creating the anomalies by simulating the physics model. We have a baseline which is working fine and then created some anomalies by changing some of the factors and recorded over time. Our aim is to create a model which can find out the anomaly during the live testing(it can be a crack on the wing), basically a real time anomaly detection using statistical methods or machine learning.

A few thoughts - sorted roughly from top-to-bottom based on time investiment (assuming little/no prior ML knowledge):
start simple and validate: for what you've described this could be as simple as
create a training / validation dataset using your simulator - since you can simulate, do so for significant episodes of both "standard" and extreme forces applied to the wing
choose a real time smoother: e.g., exponential averaging or moving average, determine a proper parameter for each of your input sensor signals. smooth the input signals.
determine threshold values:
- create rough but sensible lower bound threshold values "by eye"
- use simple statistics to determine a decent threshold value (e.g., using a moving fixed length window of appropriate size, and setting the threshold at a multiple of the standard deviation in that window slid across the entire signal)
in either case, testing on further simulated (and - ideally also - real data)
If an effort like this works "good enough" - stop and move on to next (facet of) problem. If not
follow the first two steps (simulate and smooth data)
take an "autoregressive" approach create training / validation input/output pairs by running a sliding window of fixed length over the input signal(s). train a simple supervised learner on thes pairs, for each input signal or all together, to produce a (set of) time series anamoly detectors trained on your simulated data. cross-validate with the validation portion of your data.
use this model (or one like it) on your validation data to test performance - and ideall collect real data (not simulated) to validate your model even further on.
If this sort of approach produces "good enough" results - stop, and move onto the next facet of the problem.
If not - examine and try any number of anomoly detection approaches coded in a variety languages listed on an aggregator like the awesome repo for time series anomaly detection

Related

Maximum log-likelihood from data histogram not data directly

I have a complicated theoretical Probability Density Function (PDF) that I define in mathematica and that depends on some parameters that I need to estimate from comparison with real data. From a big simulation done on a cluster and not my laptop I have acquired a lot of events (over 10^9).
The way I understand things, given that I know what the PDF is I 'just' need to sum the probability that those events appear for a given set of parameters and maximise this quantity by adjusting the parameters.
However, given the number of events I would rather work with something less computer-time consuming and work for example with something easily generated like an histogram of my data. But then how would my log-likelihood estimator work?
Thanks a lot for your answers!

Ideas on filtering out consistent time series data

So I have two subsets of data that represent two situations. The one that look more consistent needs to be filtered out (they are noise) while the one looks random are kept (they are motions). The method I was using was to define a moving window = 10 and whenever the standard deviation of the data within the window was smaller than some threshold, I suppressed them. However, this method could not filter out all "consistent" noise while also hurting the inconsistent one (real motion). I was hoping to use some kinds of statistical models and not machine learning to accomplish this. Any suggestions would be appreciated!
noise
real motion
The Kolmogorov–Smirnov test is used to compare two samples to determine if they come from the same distribution. I realized that real world data would never be uniform. So instead of comparing my noise data against the uniform distribution, I used scipy.stats.ks_2samp function to compare any bursts against one real motion burst. I then muted the motion if the return p-value is significantly small, meaning I can reject the hypothesis that two samples are from the same distribution.

Regularize unevenly spaced time series with spark-ts

We plan to store our sensor time series data in cassandra and use spark/spark-ts to apply machine learning algorithms on it.
Unlike in the documentation, our time series data is irregular - unevenly spaced time series - as the sensors send the data event-based.
But most algorithms and models require regular time series.
Does spark-ts provide any function to transform the irregular time series to regular ones (using interpolation or time-weighted-average, etc.)?
If not, what would be a recommended approach to solve that problem ?
spark-ts does not provide any function to transform irregular time series to regular ones.
How you handle irregularly-spaced time series depends on the goals you are trying to achieve through your analysis. Use cases for time series include prediction/forecasting, anomaly detection, or trying to understand/analyze past behaviour.
If you wish to use the algorithms available in spark-ts (as opposed to modeling your data through other statistical processes designed for event streams), one option is to divide the time axis into equally-sized bins, and then compute a summary of your data within each bin (e.g., the total, the mean, etc.). As you make your bins more and more fine-grained, the information lost due to quantizing the time dimension is minimized, but your data may be harder to model (so the bin size controls the tradeoff). And so, the binned data then forms an evenly-spaced time series, which you can analyze using typical time series techniques.

Distance dependent Chinese Restaurant Process maybe

I'm new to machine learning and want to implement the distance dependent Chinese Restaurant process in MATLAB for the clustering of audio tracks.
I'm looking to use the dd-CRP on 26 features. I'm guessing the process might go like this
Read in 1st feature vector and assign it a "table"
Read in 2nd feature vector and compare it to the 1st "table", maybe using the cosine angle(due to high dimension) of the two vectors and if it agrees within some defined theta, join that table, else start a new one.
Read in next feature and repeat step 2 for the new feature vector for each existing table.
While this is occurring, I will be keeping track of how many tables there are.
I will be running the algorithm over say for example 16 audio tracks. The way the audio will be fed into the algorithm is the first feature vector will be from say the first frame from audio track 1, the second feature vector from form the first frame in track 2 etc. as I'm trying to find out which audio tracks like to cluster together most, but I don't want to define how many centroids there are. Obviously I'll have to keep track of which audio track is at which "table".
Does this make sense?
This is not a Chinese Restaurant Process. This is a heuristic algorithm which has some similarity to a Chinese Restaurant Process. In a CRP everything is phrased in terms of priors over the assignments of items to clusters (the tables analogy), and these are combined with a likelihood function for each cluster (which formalises the similarity function you described). Inference is then done by Gibbs Sampling, which means non-deterministically sampling which cluster each track is assigned to in turn given all the other assignments. Variational methods for non-parametrics are still in a very preliminary state.
Why do you want to use a CRP? Do you think you'll get something out of it beyond more conventional clustering methods? The bar to entry for the implementation and proper understanding of non-parametrics is pretty high, and they're often of little practical use at the moment because of the constraints on inference I mentioned.
You can use the X-means algorithm, which automatically determines the optimal number of centroids (and hence number of clusters) based on the Bayesian Information Criterion (or BIC). In short, the algorithm looks for how dense each cluster is, and how far is each cluster from the other.

Signal processing: FFT overlap processing resources

Are there any good (if possible scientific) resources available (web or books) about overlap processing. I am not that interested in the effects of using overlap processing and windows when analyzing a signal, since the requirements are different. It is more about the following Real Time situation: (I am currently dealing with audio signals)
Dividing a signal into smaller parts.
Creating overlap windows.
FFTing the windowed chunks.
Do processing in the frequency domain.
IFFT the results.
put the chunks together to a continuous stream.
I am especially interested in the influence of the window used on the resulting error as well as the effect of the overlap length. However I couldn't find any good resources that deal with the subject in detail. Any suggestions?
Edit:
After some discussions if using a window function is appropriate, I found a decent handout explaining the overlap and add/save method. http://www.ece.tamu.edu/~deepa/ecen448/handouts/08c/10_Overlap_Save_Add_handouts.pdf
However, after doing some tests, I noticed that the windowed version would perform more accurate in most cases than the overlap & add/save method. Could anybody confirm this?
I don't want to jump to any conclusions regarding computation time though....
Edit2:
Here are some graphs from my tests:
I created a signal, which consists of three cosine waves
I used this filter function in the time domain for filtering. (It's symmetric, as it is applied to the whole output of the FFT, which also is symmetric for real input signals)
The output of the IFFT looks like this: It can be seen that low frequencies are attenuated more than frequency in the mid range.
For the overlap add/save and the windowed processing I divided the input signal into 8 chunks of 256 samples. After reassembling them they look like that. (sample 490 - 540)
It can be seen that the overlap add/save processes differ from the windowed version at the point where chunks are put together (sample 511). This is the error which leads to different results when comparing windowed process and overlap add/save. The windowed process is closer to the one processed in one big junk.
However, I have no idea why they are there or if they shouldn't be there at all.
This is fairly well-known area of signal processing, and generally speaking if you are doing processing along the lines of FFT -> spectral processing -> IFFT you need to use the "overlap and add" approach. Cross-correlation of two inputs is a classic example, done much more easily in the spectral domain than the time domain.
Here's a short paper I found right away via Google (I just searched for "fft overlap and add"): http://www.coe.montana.edu/ee/rmaher/ee477/ee477_fftlab_sp07.pdf
I would recommend you invest in a good Signal Processing book, such as the classic Rabiner & Gold "Theory and application of digital signal processing" (Prentice-Hall ISBN 0-13-914101-4). That should cover the concept of overlap-and-add processing.
When using an FFT for overlap-add or overlap-save fast convolution filtering, normally you don't want to use a windowing function. The circular windowing artifacts cancel out when combining successive FFT frames in canonical overlap add/save filtering.
ADDED:
If you do use a non-rectangular window, you might want to make sure that all the overlapped frames of windows sum to DC, otherwise your resulting filtered signal will have amplitude scalloping. Rectangular windows and raised-cosine (von Hann) windows will sum to DC if the overlap amount is an exact submultiple of the window width (except, of course, at the very start and end of the overlap sequence).
I have been playing with this attempting to answer the question for myself as to why one would use a window. My only references to a synthesis window are this:
https://ccrma.stanford.edu/~jos/sasp/Inverse_FFT_Synthesis.html
http://recherche.ircam.fr/anasyn/roebel/amt_audiosignale/VL2.pdf
http://www.dspdimension.com/tutorials/
Stephan Bernsee has some good overview information. His smbpitchshift code uses a synthesis window -- He uses the raised cosine on the input block, then applies it again on the output block, but this I believe is necessary because the pitch shifting algorithm is not a linear filtering operation, so it is certain there may be discontinuous artifacts on the window boundaries, thus a synthesis window is used to create a smooth transition between frames.
I think the reason there is not much information specifically addressing windowing for frequency domain real-time convolution is because it doesn't have a practical application unless you also need to do some analysis (ie, and adaptive filter of some sort), then the topics related to spectral spreading is again of interest.
I have plotted outputs from a filtered signal using both a raised cosine window as well as overlap-add method, and the end result is an identical IR, and identical signals. It comes as no surprise since the same operations performed in the time domain yield the same results.
On the other hand, if I implement a broken filter kernel, a smooth windowing function can help mask artifacts. This in a sense windows the broken IR so there is a more cohesive transition between frames. It would still be better to have an IR that is limited to length nfft/2 in the time domain. If you need to obtain a filter response with an IR longer than nfft/2, then you should consider either using a larger FFT size (if latency is not a problem) or use a partitioned convolution scheme:
http://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=0CB4QFjAA&url=http%3A%2F%2Fpcfarina.eng.unipr.it%2FPublic%2FPapers%2F164-Mohonk2001.PDF&ei=qtH0TorDEoKziQKAloHEDg&usg=AFQjCNGDmz79DiuG1kmPXifbWJ7M-gr9rQ&sig2=CMopEcGc1VArZ3gipWTr_w
or
http://www.music.miami.edu/programs/mue/Research/jvandekieft/jvchapter2.htm
I hope that is helpful to somebody reading this
I hope those links help, even though it doesn't directly address windowing as used in real-time Frequency domain filtering.

Resources