Regularize unevenly spaced time series with spark-ts - apache-spark

We plan to store our sensor time series data in cassandra and use spark/spark-ts to apply machine learning algorithms on it.
Unlike in the documentation, our time series data is irregular - unevenly spaced time series - as the sensors send the data event-based.
But most algorithms and models require regular time series.
Does spark-ts provide any function to transform the irregular time series to regular ones (using interpolation or time-weighted-average, etc.)?
If not, what would be a recommended approach to solve that problem ?

spark-ts does not provide any function to transform irregular time series to regular ones.
How you handle irregularly-spaced time series depends on the goals you are trying to achieve through your analysis. Use cases for time series include prediction/forecasting, anomaly detection, or trying to understand/analyze past behaviour.
If you wish to use the algorithms available in spark-ts (as opposed to modeling your data through other statistical processes designed for event streams), one option is to divide the time axis into equally-sized bins, and then compute a summary of your data within each bin (e.g., the total, the mean, etc.). As you make your bins more and more fine-grained, the information lost due to quantizing the time dimension is minimized, but your data may be harder to model (so the bin size controls the tradeoff). And so, the binned data then forms an evenly-spaced time series, which you can analyze using typical time series techniques.

Related

Min Max Normalization/Normal Distribution

I have a dataset with county level data where N=3119 with 93 variables. I am trying to do a PCA, EFA and or CFA. The data has been given to me already min/max normalized, ranging from (0,1). Theory states that the data should be normally distributed for CFA/SEM, but my understanding is that min/max normalization does not change the distribution of the data, only it's scale.
It is clear to me that I do not have multivariate normality or univariate normality due to the skewness of data. I guess what's confusing me, is when people seemingly throw around the term normalization interchangeably with the meaning of normal distribution.
So can I go forward with my analysis since min/max normalization has been performed, or do I need to look more towards other log/box cox transformations to adjust the distribution prior to running my analysis? Is it okay to log transform data that has already been min/max normalized?
my understanding is that min/max normalization does not change the distribution of the data, only it's scale.
Correct. If you print a hist()ogram of original and transformed data, they should look identical. Only the x-axis scale will change.
the term normalization interchangeably with the meaning of normal distribution
Indeed, these are completely separate issues.
Is it okay to log transform data that has already been min/max normalized?
Taking the log() would affect 0--1 data differently than data further up the real-number line. But I don't see why you need to transform the data when nonnormality corrections are available for SEs (in EFA or CFA) and model-fit test statistics (relevant for CFA). Independent-components analysis might be an alternative to PCA if your data are not normal.

Time series anomaly detection

I would like to get suggestions about a time series problem. The data is about strain gauge on the wing of flight which is measured using different sensors. Basically, we are creating the anomalies by simulating the physics model. We have a baseline which is working fine and then created some anomalies by changing some of the factors and recorded over time. Our aim is to create a model which can find out the anomaly during the live testing(it can be a crack on the wing), basically a real time anomaly detection using statistical methods or machine learning.
A few thoughts - sorted roughly from top-to-bottom based on time investiment (assuming little/no prior ML knowledge):
start simple and validate: for what you've described this could be as simple as
create a training / validation dataset using your simulator - since you can simulate, do so for significant episodes of both "standard" and extreme forces applied to the wing
choose a real time smoother: e.g., exponential averaging or moving average, determine a proper parameter for each of your input sensor signals. smooth the input signals.
determine threshold values:
- create rough but sensible lower bound threshold values "by eye"
- use simple statistics to determine a decent threshold value (e.g., using a moving fixed length window of appropriate size, and setting the threshold at a multiple of the standard deviation in that window slid across the entire signal)
in either case, testing on further simulated (and - ideally also - real data)
If an effort like this works "good enough" - stop and move on to next (facet of) problem. If not
follow the first two steps (simulate and smooth data)
take an "autoregressive" approach create training / validation input/output pairs by running a sliding window of fixed length over the input signal(s). train a simple supervised learner on thes pairs, for each input signal or all together, to produce a (set of) time series anamoly detectors trained on your simulated data. cross-validate with the validation portion of your data.
use this model (or one like it) on your validation data to test performance - and ideall collect real data (not simulated) to validate your model even further on.
If this sort of approach produces "good enough" results - stop, and move onto the next facet of the problem.
If not - examine and try any number of anomoly detection approaches coded in a variety languages listed on an aggregator like the awesome repo for time series anomaly detection

Maximum log-likelihood from data histogram not data directly

I have a complicated theoretical Probability Density Function (PDF) that I define in mathematica and that depends on some parameters that I need to estimate from comparison with real data. From a big simulation done on a cluster and not my laptop I have acquired a lot of events (over 10^9).
The way I understand things, given that I know what the PDF is I 'just' need to sum the probability that those events appear for a given set of parameters and maximise this quantity by adjusting the parameters.
However, given the number of events I would rather work with something less computer-time consuming and work for example with something easily generated like an histogram of my data. But then how would my log-likelihood estimator work?
Thanks a lot for your answers!

Ideas on filtering out consistent time series data

So I have two subsets of data that represent two situations. The one that look more consistent needs to be filtered out (they are noise) while the one looks random are kept (they are motions). The method I was using was to define a moving window = 10 and whenever the standard deviation of the data within the window was smaller than some threshold, I suppressed them. However, this method could not filter out all "consistent" noise while also hurting the inconsistent one (real motion). I was hoping to use some kinds of statistical models and not machine learning to accomplish this. Any suggestions would be appreciated!
noise
real motion
The Kolmogorov–Smirnov test is used to compare two samples to determine if they come from the same distribution. I realized that real world data would never be uniform. So instead of comparing my noise data against the uniform distribution, I used scipy.stats.ks_2samp function to compare any bursts against one real motion burst. I then muted the motion if the return p-value is significantly small, meaning I can reject the hypothesis that two samples are from the same distribution.

What descriptive statistics are commonly used for time-series data?

I have a time-series of weekly usage data and I'm going to attempt to use some statistics to segment the population. Skewness and Kurtosis to may allow me to describe the time-series and group the people in different ways. But I also notice some appear to have saw-tooth patterns, or bimodal patterns, then I don't think these two aforementioned statistics will describe them well. Distance from the mean would tell me who has continual steady usage vs. unpredictable usage.
What descriptive statistics are commonly used for time-series data?
Thanks,
The periodogram and the autocorrelation function are two common sources of information
used to analyse and model time series. You can use this information to compare the series.
In the periodogram you can detect the frequencies at which the estimated spectral density is the highest. This will tell you which series are dominated by cycles of the same frequency.
The autocorrelation function (the time domain counterpart of the periodogram) and the partial autocorrelation function can similarly be used to compare and group the series. Those series with significant autocorrelations at the same lag orders could be grouped together.
You may need to transform the series in order to discern some of this information, for example taking differences to render the data stationary. Alternatively you can select an ARIMA model for each series and compare the characteristics of each model (those characteristics will be pretty much the same as those observed in the autocorrelation functions).

Resources