I have an input source that gives me integers in [0..256].
I want to be able to locate spikes in this data, i.e. a new input.
I've tried using a rolling average in conjunction with finding the percent error. But this doesn't really work.
Basically, I want my program to find where a graph of the data would spike up, but I want it to ignore smooth transitions.
Thoughts?
A simple thought which follows my comment. First
>>> import numpy as np
Suppose we have the following time series
>>> sample = np.random.random_integers(0,256,size=(100,))
To know whether or not a spike can be considered as a rare event, we have to know the likelihood associated to each event. Since you are dealing with "rates of change", let us compute those
>>> sample_vars = np.abs(-1 + 1.*sample[1:]/sample[:-1]) # the "1.*" to get floats... (python<3)
We can then define the variation which has at most 5 percent (sample-) chance of occurring
>>> spike_defining_threshold = np.percentile(sample_vars, 95)
Finally if sample_vars[-1]>spike_defining_threshold
Would be great if others have thoughts to share as well...
Related
I wrote a simple sample code here. In fact, elements will be added or deleted from the set and a random element will be chosen from the set on each iteration in my program.
But even if I run the simplified code below, I got different output every time I run the codes. So, how to make the outputs reproducible?
import random
random.seed(0)
x = set()
for i in range(40):
x.add('a'+str(i))
print(random.sample(x, 1))
The problem is that a set's elements are unordered, and will vary between runs even if the random sample chooses the same thing. Using random.sample on a set is in fact deprecated since python 3.9, and in the future you will need to input a sequence instead.
You could do this by converting the set to a sequence in a consistently ordered way, such as
x = sorted(x)
or probably better, just use a type like list in the first place (always producing ['a24'] in your example).
x = ['a' + str(i) for i in range(40)]
I'm (desperately) trying to figure out Tensorflow 2.0 without much luck so far, but I think I'm close with what I need right now.
I've followed the doc here to make a simple network to forecast stock data (not weather data), and what I'd like to do now is, forecast the future using the latest/most recent section of the validation dataset. I'm hoping someone else has read through it already and can help me here.
The code to predict the future using the validation dataset looks like this:
for x, y in val_data_multi.take(3):
multi_step_plot(x[0], y[0], multi_step_model.predict(x)[0])
...where to the best of my knowledge, it takes a random chunk (3 separate times), and in my case is a 20 row x 9 column section, from the val_data_multi "Repeat dataset" type, and then uses the model's multi_step_plot function to spit out a plot that has the predicted values based on that random section of the validation dataset. But what if I don't want to just take a random validation section, I want to use the bottom of my actual dataset? So that if I have recent stock data at the bottom of my validation dataset, and I want to forecast for the future that hasn't happened yet, how can I take a 20x9 section from the bottom of that set, and not just have it "take" a random section to predict with?
As a pseudo code attempt to explain what I'm trying to do, I was trying something like:
for x, y in val_data_multi[-20:].take(1): #.take(3)
...to try and make it take one section 20 rows up from the bottom, and all columns. But of course this didn't work as TypeError: 'RepeatDataset' object is not subscriptable.
I hope that makes sense, and if it'll help for me to post my code, I can do that, but I'm just using what's already shown in that page, just made some modifications to use a stock dataset, that's all.
I was able to find a much better guide from this Github repo:
https://github.com/Hvass-Labs/TensorFlow-Tutorials/blob/master/23_Time-Series-Prediction.ipynb
...which basically gets into better detail what I'm looking to do and made it very easy to understand. Thanks!
I try to fit data using standard defined functions (Lorentzian & Gaussian) from lmfit package. The program works quite well for some data set but for another one its not able to fit because the initial values doesnt seem right. Is there any algorithm which can extract the initial values from the data set and do some iterations in order to find the best fit?
I tried some common methods like bruethe-force algorithm but the results are not satisfactory and it cost a lot of time.
It is always recommended to provide a small, complete example script that shows the problem you are having. How could we know why it works in some cases and not in others?
lmfit.GaussianModel and lmfit.LorentzianModel both have guess methods. This should work reasonably well data with an isolated peak, working like
import lmfit
model = lmfit.models.GaussianModel()
params = model.guess(ydata, x=xdata)
for p in params.values():
print(p)
result = model.fit(ydata, params, x=xdata)
print(result.fit_report())
If the data doesn't have a clear isolated peak, that might not work so well.
If finding the peak(s) is the actual problem, try scipy.signal.find_peaks
(https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.find_peaks.html)or peakutils (https://peakutils.readthedocs.io/en/latest/). Either of these should give you a good estimate of the center parameter, which is probably the most likely to cause bad fits if a poor initial value is give.
Following links were investigated but didn't provide me with the answer I was looking for/fixing my problem: First, Second.
Due to confidentiality issues I cannot post the actual decomposition I can show my current code and give the lengths of the data set if this isn't enough I will remove the question.
import numpy as np
from statsmodels.tsa import seasonal
def stl_decomposition(data):
data = np.array(data)
data = [item for sublist in data for item in sublist]
decomposed = seasonal.seasonal_decompose(x=data, freq=12)
seas = decomposed.seasonal
trend = decomposed.trend
res = decomposed.resid
In a plot it shows it decomposes correctly according to an additive model. However the trend and residual lists have NaN values for the first and last 6 months. The current data set is of size 10*12. Ideally this should work for something as small as only 2 years.
Is this still too small as said in the first link? I.e. I need to extrapolate the extra points myself?
EDIT: Seems that always half of the frequency is NaN on both ends of trend and residual. Same still holds for decreasing size of data set.
According to this Github link another user had a similar question. They 'fixed' this issue. To avoid NaNs an extra parameter can be passed.
decomposed = seasonal.seasonal_decompose(x=data, freq=12, extrapolate_trend='freq')
It will then use a Linear Least Squares to best approximate the values. (Source)
Obviously the information was literally on their documentation and clearly explained but I completely missed/misinterpreted it. Hence I am answering my own question for someone who has the same issue, to save them the adventure I had.
According to the parameter definition below, setting extrapolate_trend other than 0 makes the trend estimation revert to a different estimation method. I faced this issue when I had a few observations for estimation.
extrapolate_trend : int or 'freq', optional
If set to > 0, the trend resulting from the convolution is
linear least-squares extrapolated on both ends (or the single one
if two_sided is False) considering this many (+1) closest points.
If set to 'freq', use `freq` closest points. Setting this parameter
results in no NaN values in trend or resid components.
I am using rolling mean on my data to smoothen it. My data can be found here.
An illustration of my original data is;
Currently, I am using
import pandas as pd
import numpy as np
data = pd.read_excel('data.xlsx')
data = np.array(data, dtype=np.float)
window_length = 9
res = pd.rolling_mean(np.array(data[:, 2]), window_length, min_periods=1, center=True)
This is what I get after applying rolling mean with a window_length of 9;
And when i increase the window_length to 20, I get a smoother image but at boundaries, the data seems to be erroneous.
The problem is, as seen in the figures above, the rolling mean introduces some sort of errors at the boundaries of my data which do not exist in the original data.
Is there any way to correct this?
My guess is, at the boundary, since part of the window_length is found outside my data, it exaggerates the mean.
Is there a way to correct this error using pandas rolling mean or is there a better pythonic way in doing this? Thanks.
Ps. I am aware the panda function of rolling mean i am using is deprecated in the new versiĆ³n.
You can try a native 2D convolution method such as scipy.ndimage.filters.convolve with weights so just make the kernel an average (mean) function.
The weights would be:
n = 3. # size of kernel over which to calculate mean
weights = np.ones(n,n)/n**2
If the white area of your data are represented by nans, this would reduce the footprint of the result by n since any kernel stamp with a nan included will return a nan. If this is really an issue try look at astropy.convolution, which has better nan handling.