What is the logic behind the output of statistics.quantiles method - statistics

I have just started learning stats , and came across on this quantiles concept, want to know the reason behind the output, if- a =[1,2,3,3] statistics.quantiles(a) it returns [1.25,2.5, 3.0] , what is the reason behind 1.25 value? When method is 'inclusive' it gives 1.75.

Related

gnuplot fit undefined value during function evaluation

I have the datafile:
10.0000 -330.12684910
15.0000 -332.85109334
20.0000 -333.85785274
25.0000 -334.18315783
30.0000 -334.28078907
35.0000 -334.30486903
40.0000 -334.30824069
45.0000 -334.30847874
50.0000 -334.30940105
55.0000 -334.31091085
60.0000 -334.31217217
The commands a used to fit this
f(x) = a+b*exp(c*x)
fit f(x) datafile via a, b, c
didn't get the negative exponential that I expected, then just to see how the hyperbola fitted I tried
f(x) = a+b/x
fit f(x) datafile via a, b
but decided to do this:
f(x) = a+b*exp(-c*x)
fit f(x) datafile via a, b, c
and it worked. I continued doing fits but in some point it started to mark this error undefined value during function evaluation.
I restarted the session and deleted the fit.log file, I thought it was a gnuplot bug but since then I always receive the undefined value error. I've been reading similar issues. It could be a, b, c seeds but I have introduced very similar values to the ones a received that time it fitted well but didn't work. I'm thinking the problem might be chaotic or I'm doing something wrong.
Thank you for your method. I understand gnuplot uses non linear least squares method for fitting.
I found one solution is to use the model y=a+b*exp(-c*c*x), I also found better initial values and it worked. anyway I have this other dataset:
2 -878.11598213
6 -878.08846509
10 -878.08105262
19 -878.07882425
28 -878.07793702
44 -878.07755010
60 -878.07738151
85 -878.07729504
110 -878.07725107
And gnuplot fit does the work but really bad. instead I used your method, here I show a comparison:
gnuplot fit and jjaquelin comparison
it is way better.
I don't know the algorithm used by gnuplot. Probably an iterative method starting from guessed values of the parameters. The difficulty might come from not convenient initial values and/or from no convergence of the process.
From my own calculus the result is close to the values below. The method, which is not iterative and doesn't require initial guess, is explained in the paper : https://fr.scribd.com/doc/14674814/Regressions-et-equations-integrales
FOR INFORMATION :
The linearisation of the regression is obtained thanks to the integral equation
to which the function to be fitted is solution.
The paper referenced above is mainly written in French. It is pratially translated in : https://scikit-guess.readthedocs.io/en/latest/appendices/references.html

Linear interpolation in PromQL or MetricsQL

I am evaluating VictoriaMetrics for an IoT application where we sometimes have gaps in a series due to hardware or communication issues. In some time series reporting situations it is helpful for us to interpolate values for the missing time intervals. I see that MetricsQL (which extends PromQL) has a keep_last_value() function that will fill gaps by holding the last observed value until a new one appears (which will be helpful to us) but in some situations a linear interpolation between the values before and after the gap is a more realistic estimate for the missing portion. Is there a function in PromQL or MetricsQL that will do linear interpolation of missing data in a series, or is it possible to construct a more complex query that will achieve this?
Clarifying the desired interpolation
What I would like is a simple interpolation between the points immediately before and after the gap; this is, I believe, what TimescaleDB's interpolate() function does. In other words, if my time series is:
(1:00, 2)
(2:00, 4)
(3:00, NaN)
(4:00, 5)
(5:00, 1)
I would like the interpolated 3:00 value to be 4.5, half way between the points immediately before and after. I don't want it to be 6 (which is what I would get by extrapolating from the points before the missing one, ignoring the points after) and I don't want whatever value I would get if I did linear regression on the whole series and interpolated at 3:00 (presumably 3, or something close to it).
Of course, this is a simple illustration and it's also possible that the gap could last more than one time step. But in that case I would still like the interpolation to be based solely off of the points immediately before and immediately after the gap, ignoring the rest of the series.
Final answer
Use the interpolate function, now available in VictoriaMetrics starting from v1.38.0.
Original suggestion
This does not achieve the exact interpolation requested in the revised question, but may be useful for others with slightly different requirements
Try combining predict_linear function with default operator from MetricsQL in the following way:
metric default predict_linear(metric[1h], 0)
Try modifying the value in square brackets in order to get the desired level of interpolation.

How can I get the initial values from my dataset for a combined lorentzian and gaussian fit?

I try to fit data using standard defined functions (Lorentzian & Gaussian) from lmfit package. The program works quite well for some data set but for another one its not able to fit because the initial values doesnt seem right. Is there any algorithm which can extract the initial values from the data set and do some iterations in order to find the best fit?
I tried some common methods like bruethe-force algorithm but the results are not satisfactory and it cost a lot of time.
It is always recommended to provide a small, complete example script that shows the problem you are having. How could we know why it works in some cases and not in others?
lmfit.GaussianModel and lmfit.LorentzianModel both have guess methods. This should work reasonably well data with an isolated peak, working like
import lmfit
model = lmfit.models.GaussianModel()
params = model.guess(ydata, x=xdata)
for p in params.values():
print(p)
result = model.fit(ydata, params, x=xdata)
print(result.fit_report())
If the data doesn't have a clear isolated peak, that might not work so well.
If finding the peak(s) is the actual problem, try scipy.signal.find_peaks
(https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.find_peaks.html)or peakutils (https://peakutils.readthedocs.io/en/latest/). Either of these should give you a good estimate of the center parameter, which is probably the most likely to cause bad fits if a poor initial value is give.

Impulse response analysis

I ran an impulse response analysis on a value weighted stock index and a few variables in python and got the following results:
I am not sure how to interpret these results.
Can anyone please help me out?
You might want to check the book "New introduction to Multiple Time Series Analysis" by Helmut Lutkepohl, 2005, for a slightly dense theory about the method.
In the meantime, a simple way you can interpret your plots is, let's say your variables are VW, SP500, oil, uts, prod, cpi, n3 and usd. They all are parts of the same system; what the impulse response analysis does is, try to assess how much one variable impacts another one independently of the other variables. Therefore, it is a pairwise shock from one variable to another. Your first plot is VW -> VW, this is pretty much an autocorrelation plot. Now, look at the other plots: apparently, SP500 exerts a maximum impact on VW (you can see a peak in the blue line reaching 0.25. The y-axis is given in standard deviations and x-axis in lag-periods. So in your example, SP500 cause a 0.25 change in VW at the lag of whatever is in your x-axis (I can't see from your figure). Similarly, you can see n3 negatively impacting VW at a given period.
There is an interesting link that you probably know and shows an example of the application of Python statsmodels VAR for Impulse Response analysis
I used this method to assess how one variable impact another in a plant-water-atmosphere system, there are some explanations there and also the interpretation of similar plots, take a look:
Use of remote sensing indicators to assess effects of drought and human-induced land degradation on ecosystem health in Northeastern Brazil
Good luck!

Google Natural Language Sentiment Analysis Aggregate Scores

In this part of the documentation of the Google Cloud Platform Natural Language API, it is described that
The overall score and magnitude values for an entity are an aggregate of the specific score and magnitude values for each mention of the entity.
I can't figure out how this aggregation works. In the example provided in the documentation, Marvin Gaye has two mentions. One of the mentions has a sentiment of 0.4 and a magnitude of 0.4, the other mention has a score of -0.2 and a magnitude 0.2. The aggregate sentiment for Marvin Gaye is score 0.1 and magnitude 0.6.
I have tried other texts myself and can't figure out how the aggregation is made. Does anyone know?
I think it depends on the length of the document, and how are you using some key words, i ran some tests and the results were all different except for a couple when i used a name of a famous person and didn't use any expression showing a emotion, because always got 0.
I can say that it's not an a sum of values, could be some weird operation using the values that is showed in the response.
About the Marvin Gaye example, the result is a mixed sentiment, because of the use of the emotions: "is the best" and "so sad".
Hope this helps with your research.
I contacted Google Cloud Platform Support and got this answer:
"The way the aggregation works is breaking down the input text into smaller components, often ngrams, which is likely why the documentation talks about aggregation, however, the aggregation isn't a simple addition, one can't sum individual sentiment values of each entity to get a total score."
So it doesn't seem possible to give a simple explanation of exactly how the aggregation is made.

Resources