I have a series of data, where the value of Count keeps increasing.
Here, there are 4 outliers (manually detected by observation) - 3 times the values of 682 and a value of 14016.
Is there any technique in Python I can use to detect these outliers?
Date
Count
Anomaly?
9/29/2019
11462
10/30/2019
12782
11/28/2019
686
anomaly
2/28/2020
13222
3/30/2020
13305
4/29/2020
13316
5/30/2020
14016
anomaly
6/29/2020
13372
7/30/2020
13487
8/30/2020
13519
9/30/2020
13553
10/30/2020
686
anomaly
11/29/2020
13577
12/26/2020
13580
1/30/2021
13594
2/27/2021
13594
3/30/2021
686
anomaly
4/29/2021
686
anomaly
5/30/2021
13619
6/11/2021
13619
Transform the series of counts into a series of differences between the counts. Then you can easily model these kinds of anomalies. In your example it seems like considering any difference under 0 to be an anomaly would work.
If the expected rate of increase is somewhat constant, or at least reasonably bounded one could also put an upper threshold on the diff to consider it anomalous.
I see that the number of days between your datapoints is not entirely regular. Then one could also normalize the difference to take this into account, by dividing by number of days since last datapoint.
Pandas and Numpy both have functions diff that can be used to do this transformation.
Related
I am trying to identify all peaks from my sensor readings data. The smallest peak can be lesser than 10 amplitude and largest can be more than 400 amplitude. The rolling time window is not fixed as one peak can arrive in 6 hours vs second one in another 3 hours. I tried wavelet transform and python peak identification but that is only working for higher peaks. How do I resolve this? Here is signal image link, all peaks in Grey color I am identifying and in blue is my algorithm
Welcome to SO.
It is hard to provide you with a detailed answer without knowing your data's sampling rate and the duration of the peaks. From what I see in your example image they seem all over the place!
I don't think that wavelets will be of any use for your problem.
A recipe that I like to use to despike data is:
Smooth your input data using a median filter (a 11 points median filter generally does the trick for me): smoothed=scipy.signal.medfilt(data, window_len=11)
Compute a noise array by subtracting smoothed from data: noise=data-smoothed
Create a despiked_data array from data:
despiked_data=np.zeros_like(data)
np.copyto(despiked_data, data)
Then every time the noise exceeds a user defined threshold (mythreshold), replace the corresponding value in despiked_data with nan values: despiked_data[np.abs(noise)>mythreshold]=np.nan
You may later interpolate the output despiked_data array but if your intent is simply to identify the spikes, you don't even need to run this extra step.
I have a list of sensor measurements for air quality with geo-coordinates, and I would like to implement outlier detection. The list of sensors is relatively small (~50).
The air quality can gradually change with the distance, but abrupt local spikes are likely outliers. If one sensor in the group of closely located sensors shows a higher value it could be an outlier. If the same higher value is shown by more distant sensors it might be OK.
Of course, I can ignore coordinates and do simple outlier detection assuming the normal distribution, but I was hoping to do something more sophisticated. What would be a good statistical way to model this and implement outlier detection?
The above statement, ("If one sensor in the group of closely located sensors shows a higher value it could be an outlier. If the same higher value is shown by more distant sensors it might be OK."), would indicate that sensors that are closer to each other tend to have values that are more alike.
Tobler’s first law of geography - “everything is related to everything else, but near things are more related than distant things”
You can quantify an answer to this question. The focus is should not be on the location and values from outlier sensors. Use global spatial autocorrelation to answer the degree to which sensors that are near each other tend to be more alike.
As a start, you will first need to define neighbors for each sensor.
I'd calculate a cost function, consisting of two costs:
1: cost_neighbors: Calculates the deviance from the sensor value of an expected value. The expected value is calculated by summing up all the values and weighting them by their distance.
2: cost_previous_step: Check how much the value of the sensor changed compared to the last time step. Large change in value leads to a large cost.
Here is some pseudo code describing how to calculate the costs:
expected_value = ((value_neighbor_0 / distance_neighbor_0)+(value_neighbor_1 / distance_neighbor_1)+ ... )/nb_neighbors
cost_neighbors = abs(expected_value-value)
cost_previous_timestep = value#t - value#t-1
total_cost = a*cost_neighbors + b*cost_previous_timestep
a and b are parameters that can be tuned to give each of the costs more or less impact. The total cost is then used to determine if a sensor value is an outlier, the larger it is, the likelier it is an outlier.
To figure out the performance and weights, you can plot the costs of some labeled data points, of which you know if they are an outlier or not.
cost_neigbors
| X
| X X
|
|o o
|o o o
|___o_____________ cost_previous_step
X= outlier
o= non-outlier
You can now either set the threshold by hand or create a small dataset with the labels and costs, and apply any sort of classifier function (e.g. SVM).
If you use python, an easy way to find neighbors and their distances is scipy.spatial.cKDtree
I have a requirement where I have to verify the transmit power out of a device as measured at its connector is within 2 dB of its expected value over 95% of test measurements.
I am using a signal analyzer to analyze the transmitted power. I only get the average power value, min, max and stdDev of the measurements and not the individual power measurements.
Now, the question is how would I verify the "95% thing" using average power, min, max and stdDev. It seems that I can use normal distribution to find the 95% confidence level.
I would appreciate if someone can help me on this.
Thanks in anticipation
The way I'm reading this, it seems you are a statistical beginner, so if I'm wrong there, the rest of this answer will probably be insultingly basic, and I'm sorry.
Anyway, the idea is that if a dataset is normally distributed, and all the observations are independent of one another, then 95% of the data points will fall within 1.96 standard deviations of the mean.
Do you get identical estimates of average power every time you measure, or are there some slight random differences from reading to reading? My guess is that it's the second. If you were to measure the power a whole bunch of times, and each time you plotted your average power value on a histogram, then that histogram of sample means would have the shape of a bell curve. This bell curve of sample means would have its own mean and standard deviation, and if you have thousands or millions of data points going into the calculation of each average power reading, it's not horrible to assume that it is a normal distribution. The explanation for this phenomenon is known as the 'central limit theorem', and I recommend both the Khan academy's presentation of it as well as the wikipedia page on it.
On the other hand, if your average power is the mean of some small number of data points, like for instance n= 5, or n= 30, then assumption of a normal distribution of sample means can be pretty bad. In this case, your 95% confidence interval around the average power goes from qt(0.975,n-1)*SD/sqrt(n) below the average to qt(0.975,n-1)*SD/sqrt(N) above the average, where qt(0.975,n-1) is the 97.5th percentile of the t distribution with n-1 degrees of freedom, and SD is your measured standard deviation.
I've got files with irradiance data measured every minute 24 hours a day.
So if there is a day without any clouds on the sky the data shows a nice continuous bell curves.
When looking for a day without any clouds in the data I always plotted month after month with gnuplot and checked for nice bell curves.
I was wondering If there's a python way to check, if the Irradiance measurements form a continuos bell curve.
Don't know if the question is too vague but I'm simply looking for some ideas on that quest :-)
For a normal distribution, there are normality tests.
In short, we abuse some knowledge we have of what normal distributions look like to identify them.
The kurtosis of any normal distribution is 3. Compute the kurtosis of your data and it should be close to 3.
The skewness of a normal distribution is zero, so your data should have a skewness close to zero
More generally, you could compute a reference distribution and use a Bregman Divergence, to assess the difference (divergence) between the distributions. bin your data, create a histogram, and start with Jensen-Shannon divergence.
With the divergence approach, you can compare to an arbitrary distribution. You might record a thousand sunny days and check if the divergence between the sunny day and your measured day is below some threshold.
Just to complement the given answer with a code example: one can use a Kolmogorov-Smirnov test to obtain a measure for the "distance" between two distributions. SciPy offers a neat interface for this, called kstest:
from scipy import stats
import numpy as np
data = np.random.normal(size=100) # Our (synthetic) dataset
D, p = stats.kstest(data, "norm") # Perform a one-sided Kolmogorov-Smirnov test
In the above example, D denotes the distance between our data and a Gaussian normal (norm) distribution (smaller is better), and p denotes the corresponding p-value. Other distributions can be similarly tested by substituting norm with those implemented in scipy.stats.
I have a question regarding how to determine the Duration of notes given their Onset Locations.
So for example, I have an array of amplitude values (containing short) and another array of the same size, that contains a 1 if a note onset is detected, and a 0 if not. So basically, the distance between each 1 will be used to determine the duration.
How can I do this? I know that I have to use the Sample Rate and other attributes of the audio data, but is there a particular formula that I can use?
Thank you!
So you are starting with a list of ONSETS, what you are really looking for is a list of OFFSETS.
There are many methods for onset detection (here is a paper on it) https://adamhess.github.io/Onset_Detection_Nov302011.pdf
many of the same methods can be applied to Offset Detection:
Since the onset is marked by an INCREASE in spectral content you can measure a decrease in Spectral content.
take a reasonable time window before and after your onset. (.25-.5s)
Chop up the window into smaller segments and take 50% overlapping Fourier transforms.
compute the difference between the fourier co-efficient between two successive windows decreases and only allow negative changes in SD.
multiple your results by -1.
pick the peaks off of the results
Voila, offsets.
(look at page 7 of the paper listed above for more detail about spectrial difference function, you can apply a modified (as above) version of it_
Well, if your samplerate in Hz is fs, then the time between two nodes is equal to
1/fs * <number of zeros between the two node-ones>
Very simple :-)
Regards