Working Out Big O Of Functions Using Java + Excel - excel

I have been trying to get the big O of four different functions using java and excel. I have no idea what these functions are as they have been hidden. I am not sure if this is the right place / forum to ask.
I have got the functions to give various pieces of data using some java and put them into excel along with the steps (1-n). I then put them into graphs straight away using just the n and the arbitrary measure of time they took if the output was constantly the same. For example if n = 1 always equal 200 for every time its run. For the ones that varied each time the function was run I ran the function 10 times and did an average for each step.
After I had the data I created a graph for each one and put a trendline on it. My f(1) for example was best fitted to a polynomial trendline order 2, which I assume is Quadratic (n2) of big O?. But I needed to prove it was n2, so I did =Steps/LOG(N) which made it fit best to a polynomial trendline order 3, which I assume is Cubic (n3)? (Is that right?)
I really have no idea what to do next to 'prove' that this function is Quadratic or Cubic or how to prove its best case / worst case.
So basically I am trying to work out what the missing step is.
Computation
Graph
Trendline
??? - Proof that the function has big O(?)

When you say "if n=1 always equal 200" does that mean if n=1 it takes 200 steps to run? If that's the case this function would be 200n and this O(n).
I think to solve this you should call each function on different values (I'd start with 10, 20, 30 ... ect) up to some high number. Capture these values and plot them in Excel. Then use the built in trend line function. This should give you a rough estimate of what the run time is. From there you should be able to get the Big-O.

Related

Linear interpolation in PromQL or MetricsQL

I am evaluating VictoriaMetrics for an IoT application where we sometimes have gaps in a series due to hardware or communication issues. In some time series reporting situations it is helpful for us to interpolate values for the missing time intervals. I see that MetricsQL (which extends PromQL) has a keep_last_value() function that will fill gaps by holding the last observed value until a new one appears (which will be helpful to us) but in some situations a linear interpolation between the values before and after the gap is a more realistic estimate for the missing portion. Is there a function in PromQL or MetricsQL that will do linear interpolation of missing data in a series, or is it possible to construct a more complex query that will achieve this?
Clarifying the desired interpolation
What I would like is a simple interpolation between the points immediately before and after the gap; this is, I believe, what TimescaleDB's interpolate() function does. In other words, if my time series is:
(1:00, 2)
(2:00, 4)
(3:00, NaN)
(4:00, 5)
(5:00, 1)
I would like the interpolated 3:00 value to be 4.5, half way between the points immediately before and after. I don't want it to be 6 (which is what I would get by extrapolating from the points before the missing one, ignoring the points after) and I don't want whatever value I would get if I did linear regression on the whole series and interpolated at 3:00 (presumably 3, or something close to it).
Of course, this is a simple illustration and it's also possible that the gap could last more than one time step. But in that case I would still like the interpolation to be based solely off of the points immediately before and immediately after the gap, ignoring the rest of the series.
Final answer
Use the interpolate function, now available in VictoriaMetrics starting from v1.38.0.
Original suggestion
This does not achieve the exact interpolation requested in the revised question, but may be useful for others with slightly different requirements
Try combining predict_linear function with default operator from MetricsQL in the following way:
metric default predict_linear(metric[1h], 0)
Try modifying the value in square brackets in order to get the desired level of interpolation.

Normalizing audio waveforms code implementation (Peak, RMS)

I have some audio data (array of floats) which I use to plot a simple
waveform.
When plotted, the waveform doesn't max out at the edges.
No problem - the data just needs to be normalized. I iterate once to find the max, and then iterate again dividing each by the max. Plot again and everything looks great!
But wait videos which have a loud intro, or loud explosion, causes the rest of the waveform to still be tiny.
After some research, I come across RMS that is supposed to address this. I iterate through the samples and calculate the RMS, and again divide each sample by the RMS value. This results in considerable "clipping":
What is the best method to solve this?
Intuitively, it seems I might need to calculate a local max or average based on a moving window (rather than the entire set) but I'm not entirely sure. Help?
Note: The waveform is purely for visual purposes (the audio will not be played back to the user).
You could transpose it (effectively making the y-axis non-linear, or you can think it as a form of companding).
Assuming the signal is within the range [-1, 1].
One popular quick and simple solution is to simply apply the hyperbolic tangens function (tanh). This will limit values to [-1, 1] by penalizing higher values more. If you amplify the signal before applying tanh, the effect will be more pronounced.
Another alternative is a logarithmic transform. As the signal changes sign some pre-processing has to be performed.
If r is a series of sample values one approach could be something like this:
r.log1p <- log2(1.1 * (abs(r) + 1)) * sign(r)
That is, for every value take its absolute, add one, multiply with some small constant, take the log and then finally multiply it with the sign of its corresponding old value.
The effect can be something like this:

Integrating Power pdf to get energy pdf?

I'm trying to work out how to solve what seems like a simple problem, but I can't convince myself of the correct method.
I have time-series data that represents the pdf of a Power output (P), varying over time, also the cdf and quantile functions - f(P,t), F(P,t) and q(p,t). I need to find the pdf, cdf and quantile function for the Energy in a given time interval [t1,t2] from this data - say e(), E(), and qe().
Clearly energy is the integral of the power over [t1,t2], but how do I best calculate e, E and qe ?
My best guess is that since q(p,t) is a power, I should generate qe by integrating q over the time interval, and then calculate the other distributions from that.
Is it as simple as that, or do I need to get to grips with stochastic calculus ?
Additional details for clarification
The data we're getting is a time-series of 'black-box' forecasts for f(P), F(P),q(P) for each time t, where P is the instantaneous power and there will be around 100 forecasts for the interval I'd like to get the e(P) for. By 'Black-box' I mean that there will be a function I can call to evaluate f,F,q for P, but I don't know the underlying distribution.
The black-box functions are almost certainly interpolating output data from the model that produces the power forecasts, but we don't have access to that. I would guess that it won't be anything straightforward, since it comes from a chain of non-linear transformations. It's actually wind farm production forecasts: the wind speeds may be normally distributed, but multiple terrain and turbine transformations will change that.
Further clarification
(I've edited the original text to remove confusing variable names in the energy distribution functions.)
The forecasts will be provided as follows:
The interval [t1,t2] that we need e, E and qe for is sub-divided into 100 (say) sub-intervals k=1...100. For each k we are given a distinct f(P), call them f_k(P). We need to calculate the energy distributions for the interval from this set of f_k(P).
Thanks for the clarification. From what I can tell, you don't have enough information to solve this problem properly. Specifically, you need to have some estimate of the dependence of power from one time step to the next. The longer the time step, the less the dependence; if the steps are long enough, power might be approximately independent from one step to the next, which would be good news because that would simplify the analysis quite a bit. So, how long are the time steps? An hour? A minute? A day?
If the time steps are long enough to be independent, the distribution of energy is the distribution of 100 variables, which will be very nearly normally distributed by the central limit theorem. It's easy to work out the mean and variance of the total energy in this case.
Otherwise, the distribution will be some more complicated result. My guess is that the variance as estimated by the independent-steps approach will be too big -- the actual variance would be somewhat less, I believe.
From what you say, you don't have any information about temporal dependence. Maybe you can find or derive from some other source or sources an estimate the autocorrelation function -- I wouldn't be surprised if that question has already been studied for wind power. I also wouldn't be surprised if a general version of this problem has already been studied -- perhaps you can search for something like "distribution of a sum of autocorrelated variables." You might get some interest in that question on stats.stackexchange.com.

Numerical Integration

Generally speaking when you are numerically evaluating and integral, say in MATLAB do I just pick a large number for the bounds or is there a way to tell MATLAB to "take the limit?"
I am assuming that you just use the large number because different machines would be able to handle numbers of different magnitudes.
I am just wondering if their is a way to improve my code. I am doing lots of expected value calculations via Monte Carlo and often use the trapezoid method to check my self of my degrees of freedom are small enough.
Strictly speaking, it's impossible to evaluate a numerical integral out to infinity. In most cases, if the integral in question is finite, you can simply integrate over a reasonably large range. To converge at a stable value, the integral of the normal error has to be less than 10 sigma -- this value is, for better or worse, as equal as you are going to get to evaluating the same integral all the way out to infinity.
It depends very much on what type of function you want to integrate. If it is "smooth" (no jumps - preferably not in any derivatives either, but that becomes progressively less important) and finite, that you have two main choices (limiting myself to the simplest approach):
1. if it is periodic, here meaning: could you put the left and right ends together and the also there have no jumps in value (and derivatives...): distribute your points evenly over the interval and just sample the functionvalues to get the estimated average, and than multiply by the length of the interval to get your integral.
2. if not periodic: use Legendre-integration.
Monte-carlo is almost invariably a poor method: it progresses very slow towards (machine-)precision: for any additional significant digit you need to apply 100 times more points!
The two methods above, for periodic and non-periodic "nice" (smooth etcetera) functions gives fair results already with a very small number of sample-points and then progresses very rapidly towards more precision: 1 of 2 points more usually adds several digits to your precision! This far outweighs the burden that you have to throw away all parts of the previous result when you want to apply a next effort with more sample points: you REPLACE the previous set of points with a fresh new one, while in Monte-Carlo you can just simply add points to the existing set and so refine the outcome.

Randomly select increasing subset of data to see where mean levels off

Could anyone please advise the best way to do the following?
I have three variables (X, Y & Z) and four groups (1, 2, 3 & 4). I have been using discriminant function analysis in SPSS to predict group membership of known grouped data for use with future ungrouped data.
Ideally I would like to able to randomly sample an increasing number of a subset of the data to see how many observations are required to hit a desired correct classification percentage.
However, I understand this might be difficult. Therefore, I'm looking to to do this for the means.
For example, Lets say variable X has a mean of 141 for group 1. This mean might have been calculated from 2000 observations. However, it might be the case that the mean occurred at say 700 observations. I would like to be able to calculate at what number of observations/cases the mean levels of in my data. For example, perhaps starting at 10 observations and repeating this randomly say 50 or 100 times, then increasing to 20 observations....and so on.
I understand this is a form of monte carlo testing. I have access to SPSS 15, 17 and 18 and excel. I also have access to minitab 15 & 16 and amos17 and have downloaded "R" but im not familiar with these. My experience is with SPSS and excel. I have tried some syntax in SPSS Modified from this..http://pages.infinit.net/rlevesqu/Syntax/RandomSampling/Select2CasesFromEachGroup.txt but this would still be quite time consuming on my part to enter the subset number ect etc.
Hope some one can help.
Thanks for reading.
Andy
The text you linked to is a good start (you can also use the SAMPLE command in SPSS, but IMO the Raynald script you linked to is more flexible when you think about constructing the sample that way).
In pseudo-code, the process might look like;
do n for sample size (a to b)
loop 100 times
draw sample size n
compute (& save) statistics
Here is where SPSS's macro language comes into play (I think this document is a good introduction, plus you can examine other references on the SPSS tag wiki). Basically once you figure out how to draw the sample and compute the stats you want, you just need to figure out how to write a macro so you can loop through the process (and pass it the sample size parameter). I include the loop 100 times because you want to be able to make some type of estimate about the error associated with each sample size.
If you give an example of how you compute the statistics I may be able to give examples of how to make that into a macro function and loop through the desired number of times.
#Andy W
#Oliver
Thanks for your suggestions guys. Ive managed to find a work around using the following macro from.........http://www.spsstools.net/Syntax/Bootstrap/GetRandomSampleOfVariousSizeCalcStats.txt However, for this I need to copy and paste the variable data for a given group into a new data window. Thats not to much of a problem. To take this further would anyone know how: 1/ I could get other statistics recorded eg std error, std dev ect ect. 2/Use other analysis, ideally discriminant function analysis and record in a new data window the percentage of correct classificcations rather than having lots of output tables 3/not need to copy and paste variables for each group so I can just run the macro specifying n samples for x variable on group 1, 2, 3 & 4.
Thanks again.
DEFINE !sample(myvar !TOKENS(1)
/nbsampl !TOKENS(1)
/size !CMDEND).
* myvar = the variable of interest (here we want the mean of salary)
* nbsampl = number of samples.
* size = the size of each samples.
!LET !first='1'
!DO !ss !IN (!size)
!DO !count = 1 !TO !nbsampl.
GET FILE='c:\Program Files\SPSS\employee data.sav'.
COMPUTE draw=uniform(1).
SORT CASES BY draw.
N OF CASES !ss.
COMPUTE samplenb=!count.
COMPUTE ss=!ss.
AGGREGATE
/OUTFILE=*
/BREAK=samplenb
/!myvar = MEAN(!myvar) /ss=FIRST(ss).
!IF (!first !NE '1') !THEN
ADD FILES /FILE=* /FILE='c:\temp\sample.sav'.
!IFEND
SAVE OUTFILE='c:\temp\sample.sav'.
!LET !first='0'
!DOEND.
!DOEND.
VARIABLE LABEL ss 'Sample size'.
EXAMINE
VARIABLES=salary BY ss /PLOT=BOXPLOT/STATISTICS=NONE/NOTOTAL
/MISSING=REPORT.
!ENDDEFINE.
* ----------------END OF MACRO ----------------------------------------------.
* Call macro (parameters are number of samples (here 20) and sizes of sample (here 5, 10,15,30,50).
* Thus 20 samples of size 5.
* Thus 20 samples of size 10, etc.
!sample myvar=salary nbsampl=20 size= 5 10 15 30 50.

Resources