FFT Code Break Down - python-3.x

I am trying to implement the FFT in one of my projects. Unfortunately I feel like every site I go to is explaining things way over my head. I have looked at many different sites for clarification but alas it has so far eluded me.
Each of the sites that I have so far went to has either had the code written well with no comments on the variables or other explanation or has explained things at such a level that I cannot grasp it.
I would appreciate it if someone can break down each part of this code and the process in the most descriptive way possible.
First, I know that the input to the FFT is [1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0]. What do these numbers represent? Are they in hertz or volts?
Last, I know that the output from the FFT is 4.000 2.613 0.000 1.082 0.000 1.082 0.000 2.613. What do these numbers represent? What is the unit? How can they be used to get the magnitude or frequency from the data set?
Again, I am looking for every step to be explained, so commenting the following FFT code would also be very helpful. I would be eternally grateful if you can explain this well enough that a 5 year old would understand. (I feel about that age sometimes when looking at the articles).
Thanks for all the help in advance. You guys on here have helped me out a TON.
CODE SOURCE: http://rosettacode.org/wiki/Fast_Fourier_transform#Python
CODE:
from cmath import exp, pi
def fft(x):
# I understand that we are taking the length of the array sent
# into the function and assigning it to N. But I do not get why.
N = len(x)
# I get that we are returning itself if it is the only item.
# What does x represent at this point?
if N <= 1: return x
# We are creating an even variable and assigning it to the fft of
# the even terms of x. This is possibly because we can use this
# to take advantage of the symmetry?
even = fft(x[0::2])
# We are now doing the same thing with the odd variable. It is
# going to be the fft of the odd terms of x. Why would we need
# both if we are using it to take advantage of the symmetry?
odd = fft(x[1::2])
T= [exp(-2j*pi*k/N)*odd[k] for k in range(N//2)]
return [even[k] + T[k] for k in range(N//2)] + \
[even[k] - T[k] for k in range(N//2)]
# I understand that we are printing a join formatted as a float
# I get that the float will have 3 numbers after the decimal place and
# will take up a total of 5 spots
# I also understand that the abs(f) is what is being formatted and
# that the absolute value is getting rid of the imaginary portion
# that is often seen returned by the FFT
print( ' '.join("%5.3f" % abs(f)
for f in fft([1.0, 1.0, 1.0, 1.0, 0.0, 0.0, 0.0, 0.0])) )
RETURNS:
4.000 2.613 0.000 1.082 0.000 1.082 0.000 2.613

An FFT is just a fast way to calculate of a DFT (using a factoring trick).
Perhaps learn what a DFT does first, as the FFT factoring trick might be confusing the issue of what a DFT does. A DFT is just a basis transform (a type of matrix multiply). The units can be completely arbitrary (milliVolts, inches, gallons, dollars, etc.) And any set of frequency results depends on a sample rate of the input data.

Related

Why my fit for a logarithm function looks so wrong

I'm plotting this dataset and making a logarithmic fit, but, for some reason, the fit seems to be strongly wrong, at some point I got a good enough fit, but then I re ploted and there were that bad fit. At the very beginning there were a 0.0 0.0076 but I changed that to 0.001 0.0076 to avoid the asymptote.
I'm using (not exactly this one for the image above but now I'm testing with this one and there is that bad fit as well) this for the fit
f(x) = a*log(k*x + b)
fit = fit f(x) 'R_B/R_B.txt' via a, k, b
And the output is this
Also, sometimes it says 7 iterations were as is the case shown in the screenshot above, others only 1, and when it did the "correct" fit, it did like 35 iterations or something and got a = 32 if I remember correctly
Edit: here is again the good one, the plot I got is this one. And again, I re ploted and get that weird fit. It's curious that if there is the 0.0 0.0076 when the good fit it's about to be shown, gnuplot says "Undefined value during function evaluation", but that message is not shown when I'm getting the bad one.
Do you know why do I keep getting this inconsistence? Thanks for your help
As I already mentioned in comments the method of fitting antiderivatives is much better than fitting derivatives because the numerical calculus of derivatives is strongly scattered when the data is slightly scatered.
The principle of the method of fitting an integral equation (obtained from the original equation to be fitted) is explained in https://fr.scribd.com/doc/14674814/Regressions-et-equations-integrales . The application to the case of y=a.ln(c.x+b) is shown below.
Numerical calculus :
In order to get even better result (according to some specified criteria of fitting) one can use the above values of the parameters as initial values for iterarive method of nonlinear regression implemented in some convenient software.
NOTE : The integral equation used in the present case is :
NOTE : On the above figure one can compare the result with the method of fitting an integral equation to the result with the method of fitting with derivatives.
Acknowledgements : Alex Sveshnikov did a very good work in applying the method of regression with derivatives. This allows an interesting and enlightening comparison. If the goal is only to compute approximative values of parameters to be used in nonlinear regression software both methods are quite equivalent. Nevertheless the method with integral equation appears preferable in case of scattered data.
UPDATE (After Alex Sveshnikov updated his answer)
The figure below was drawn in using the Alex Sveshnikov's result with further iterative method of fitting.
The two curves are almost indistinguishable. This shows that (in the present case) the method of fitting the integral equation is almost sufficient without further treatment.
Of course this not always so satisfying. This is due to the low scatter of the data.
In ADDITION , answer to a question raised in comments by CosmeticMichu :
The problem here is that the fit algorithm starts with "wrong" approximations for parameters a, k, and b, so during the minimalization it finds a local minimum, not the global one. You can improve the result if you provide the algorithm with starting values, which are close to the optimal ones. For example, let's start with the following parameters:
gnuplot> a=47.5087
gnuplot> k=0.226
gnuplot> b=1.0016
gnuplot> f(x)=a*log(k*x+b)
gnuplot> fit f(x) 'R_B.txt' via a,k,b
....
....
....
After 40 iterations the fit converged.
final sum of squares of residuals : 16.2185
rel. change during last iteration : -7.6943e-06
degrees of freedom (FIT_NDF) : 18
rms of residuals (FIT_STDFIT) = sqrt(WSSR/ndf) : 0.949225
variance of residuals (reduced chisquare) = WSSR/ndf : 0.901027
Final set of parameters Asymptotic Standard Error
======================= ==========================
a = 35.0415 +/- 2.302 (6.57%)
k = 0.372381 +/- 0.0461 (12.38%)
b = 1.07012 +/- 0.02016 (1.884%)
correlation matrix of the fit parameters:
a k b
a 1.000
k -0.994 1.000
b 0.467 -0.531 1.000
The resulting plot is
Now the question is how you can find "good" initial approximations for your parameters? Well, you start with
If you differentiate this equation you get
or
The left-hand side of this equation is some constant 'C', so the expression in the right-hand side should be equal to this constant as well:
In other words, the reciprocal of the derivative of your data should be approximated by a linear function. So, from your data x[i], y[i] you can construct the reciprocal derivatives x[i], (x[i+1]-x[i])/(y[i+1]-y[i]) and the linear fit of these data:
The fit gives the following values:
C*k = 0.0236179
C*b = 0.106268
Now, we need to find the values for a, and C. Let's say, that we want the resulting graph to pass close to the starting and the ending point of our dataset. That means, that we want
a*log(k*x1 + b) = y1
a*log(k*xn + b) = yn
Thus,
a*log((C*k*x1 + C*b)/C) = a*log(C*k*x1 + C*b) - a*log(C) = y1
a*log((C*k*xn + C*b)/C) = a*log(C*k*xn + C*b) - a*log(C) = yn
By subtracting the equations we get the value for a:
a = (yn-y1)/log((C*k*xn + C*b)/(C*k*x1 + C*b)) = 47.51
Then,
log(k*x1+b) = y1/a
k*x1+b = exp(y1/a)
C*k*x1+C*b = C*exp(y1/a)
From this we can calculate C:
C = (C*k*x1+C*b)/exp(y1/a)
and finally find the k and b:
k=0.226
b=1.0016
These are the values used above for finding the better fit.
UPDATE
You can automate the process described above with the following script:
# Name of the file with the data
data='R_B.txt'
# The coordinates of the last data point
xn=NaN
yn=NaN
# The temporary coordinates of a data point used to calculate a derivative
x0=NaN
y0=NaN
linearFit(x)=Ck*x+Cb
fit linearFit(x) data using (xn=$1,dx=$1-x0,x0=$1,$1):(yn=$2,dy=$2-y0,y0=$2,dx/dy) via Ck, Cb
# The coordinates of the first data point
x1=NaN
y1=NaN
plot data using (x1=$1):(y1=$2) every ::0::0
a=(yn-y1)/log((Ck*xn+Cb)/(Ck*x1+Cb))
C=(Ck*x1+Cb)/exp(y1/a)
k=Ck/C
b=Cb/C
f(x)=a*log(k*x+b)
fit f(x) data via a,k,b
plot data, f(x)
pause -1

Is there a logic error with my implementation of odeint?

I am getting the wrong solution and am unsure if odeint is the correct tool for solving this system of ODEs.
I am trying to model a simple first order chemical reaction by solving a system of ODEs. From a logic standpoint my functions are correct and I can solve this problem in MATLAB with little issue. I would like to also be able to do this work in python as well. I think odeint is the tool for the job but I could be wrong. My solution should not converge at independent variable = 10 every time but it always does regardless of inputs.
from matplotlib.pyplot import (plot,grid,xlabel,ylabel,show,legend)
import numpy as np
from scipy.integrate import odeint
wght= np.linspace(0,20)
# reaction is A -> B
def PBR(fun,W):
X,y = fun
P_0=20;#%bar
v_0=5; #%m^3/min
y_A0=1; #unitless
k=.005; #m^3/kg/min
alpha=0.1; #1/kg
epi=.13; #unitless
R=8.314; #J/mol/K
F_A0= .5 ;#mol/min
ra = -k *y*(1-X)/(1+epi*X)
dX = (-ra)/F_A0
dy = -alpha*(1+epi*X)/(2*y)
return [dX,dy]
X0 = 0.0
y0 = 1.0
sol = odeint(PBR, [X0, y0],wght)
plot(wght, sol[:, 0], 'b', label='X')
plot(wght, sol[:, 1], 'g', label='y')
legend(loc='best')
xlabel('W')
grid()
show()
print(sol)
Output graph
Your graphic is not reproducible, in python 3.7, scipy 1.4.1, I get a divergence to practically infinity at around 12.5, and after cleaning the work space, both graphs moving to zero after time 10.
The problem is your division by 2*y in the second equation. In effect that means that y is the square root of some other function, and that type of function behaves badly numerically at small values as the tangent becomes vertical, changing to zero shortly after.
However, one can desingularize the division in a simple way
dy = -alpha*(1+epi*X)*y/(1e-10+2*y*y)
or in a more complicated way that leaves the solution far away from zero really unchanged
dy = -alpha*(1+epi*X)*y/(y*y+max(1e-12,y*y))
resulting in the graph
If that is still not the expected behavior, then something else is different in your ODE system relative to the Matlab version.
That the singular point is at about w=10 depends almost completely on alpha=0.1. As epi is small and X is initially small, the second equation is close to
d(y^2)/dW = -alpha ==> y^2 = 1 - alpha*W
and this hits zero at about W=10 where the solution would have to end as the square of y can not take negative values.

Cosmic ray removal in spectra

Python developers
I am working on spectroscopy in a university. My experimental 1-D data sometimes shows "cosmic ray", 3-pixel ultra-high intensity, which is not what I want to analyze. So I want to remove this kind of weird peaks.
Does anybody know how to fix this issue in Python 3?
Thanks in advance!!
A simple solution could be to use the algorithm proposed by Whitaker and Hayes, in which they use modified z scores on the derivative of the spectrum. This medium post explains how it works and its implementation in python https://towardsdatascience.com/removing-spikes-from-raman-spectra-8a9fdda0ac22 .
The idea is to calculate the modified z scores of the spectra derivatives and apply a threshold to detect the cosmic spikes. Afterwards, a fixer is applied to remove the spike points and replace it by the mean values of the surrounding pixels.
# definition of a function to calculate the modified z score.
def modified_z_score(intensity):
median_int = np.median(intensity)
mad_int = np.median([np.abs(intensity - median_int)])
modified_z_scores = 0.6745 * (intensity - median_int) / mad_int
return modified_z_scores
# Once the spike detection works, the spectrum can be fixed by calculating the average of the previous and the next point to the spike. y is the intensity values of a spectrum, m is the window which we will use to calculate the mean.
def fixer(y,m):
threshold = 7 # binarization threshold.
spikes = abs(np.array(modified_z_score(np.diff(y)))) > threshold
y_out = y.copy() # So we don't overwrite y
for i in np.arange(len(spikes)):
if spikes[i] != 0: # If we have an spike in position i
w = np.arange(i-m,i+1+m) # we select 2 m + 1 points around our spike
w2 = w[spikes[w] == 0] # From such interval, we choose the ones which are not spikes
y_out[i] = np.mean(y[w2]) # and we average the value
return y_out
The answer depends a on what your data looks like: If you have access to two-dimensional CCD readouts that the one-dimensional spectra were created from, then you can use the lacosmic module to get rid of the cosmic rays there. If you have only one-dimensional spectra, but multiple spectra from the same source, then a quick ad-hoc fix is to make a rough normalisation of the spectra and remove those pixels that are several times brighter than the corresponding pixels in the other spectra. If you have only one one-dimensional spectrum from each source, then a less reliable option is to remove all pixels that are much brighter than their neighbours. (Depending on the shape of your cosmics, you may even want to remove the nearest 5 pixels or something, to catch the wings of the cosmic ray peak as well).

Confusing artifacts in PyWavelet complex-Morlet analysis of 1-kHz signal

I have some artifacts in a PyWavelets transform that are really confusing me. I'm using version 0.5.2. Can someone explain what is happening here?
I start by creating a 1-kHz signal, and then I attempt to analyze this signal with a complex Morlet continuous wavelet transform. I'm doing so with 3 octaves: 0.5 kHz to 1 kHz, 1 kHz to 2 kHz, and 2 kHz to 4 kHz, each one with 40 log-scaled scales. My intuition says that there should be a single peak at y=40 (being equivalent to 1 kHz), and that any difference in time should be minimal. Instead, I'm getting a peak at around y=35 to 37 (0.92 to 0.95 kHz), and there's some kind of periodic effect. (Strangely, this effect seems to occur only in the real component of the transform--the imaginary component looks closer to how I imagined it should look, though it's still not centered correctly. I believe that the real component and the imaginary component should look like time-shifted versions of each other, when looking at a pure sine wave.)
Am I misusing PyWavelets? Is there possibly a bug here? Any help would be welcome.
import numpy
import pywt
import matplotlib.pyplot as plt
# makes a 1-kHz signal
def make_data(length, quality):
tau = 2*numpy.pi
x = numpy.arange(length)
y = numpy.sin(tau * x/(quality/1000)) # the 1000 is for 1 kHz
return y
# does the continuous wavelet transform, outputting pic
def do_transform(data, base_freq, num_octaves, voices_per_octave, quality):
# calculate the scales, based on the desired frequencies
base_scale = quality / (2*base_freq)
far_scale = base_scale / 2**num_octaves
scales = numpy.geomspace(base_scale, far_scale, num=num_octaves*voices_per_octave+1, endpoint=True)
# actual calculation
coeffs, freqs = pywt.cwt(data, scales, "cmor", 1/quality)
print("freqs: " +str(freqs))
# output
truncated = coeffs[:, 100:200]
plt.imshow(abs(truncated), origin='lower', interpolation='none')
#plt.imshow(truncated.real, origin='lower', interpolation='none')
#plt.imshow(truncated.imag, origin='lower', interpolation='none')
plt.subplots_adjust(left=0.01, right=0.99, top=0.99, bottom=0.05)
plt.show()
data = make_data(1000, 44100)
do_transform(data, 500, 3, 40, 44100)
Magnitudes of transform
Real components of transform
Imaginary components of transform
It turns out that this is a known issue. (It's still not clear if this is a bug or just unexpected behavior, but the discussion is here: https://github.com/PyWavelets/pywt/issues/307 .)
Thanks to everyone who looked at it and considered it.

Spark : regression model threshold and precision

I have logistic regression mode, where I explicitly set the threshold to 0.5.
model.setThreshold(0.5)
I train the model and then I want to get basic stats -- precision, recall etc.
This is what I do when I evaluate the model:
val metrics = new BinaryClassificationMetrics(predictionAndLabels)
val precision = metrics.precisionByThreshold
precision.foreach { case (t, p) =>
println(s"Threshold is: $t, Precision is: $p")
}
I get results with only 0.0 and 1.0 as values of threshold and 0.5 is completely ignored.
Here is the output of the above loop:
Threshold is: 1.0, Precision is: 0.8571428571428571
Threshold is: 0.0, Precision is: 0.3005181347150259
When I call metrics.thresholds() it also returns only two values, 0.0 and 1.0.
How do I get the precision and recall values with threshold as 0.5?
You need to clear the model threshold before you make predictions. Clearing threshold makes your predictions return a score and not the classified label. If not you will only have two thresholds, i.e. your labels 0.0 and 1.0.
model.clearThreshold()
A tuple from predictionsAndLabels should look like (0.6753421,1.0) and not (1.0,1.0)
Take a look at https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/BinaryClassificationMetricsExample.scala
You probably still want to set numBins to control the number of points if the input is large.
I think what happens is that all the predictions are 0.0 or 1.0. Then the intermediate threshold values make no difference.
Consider the numBins argument of BinaryClassificationMetrics:
numBins:
if greater than 0, then the curves (ROC curve, PR curve) computed internally will be down-sampled to this many "bins". If 0, no down-sampling will occur. This is useful because the curve contains a point for each distinct score in the input, and this could be as large as the input itself -- millions of points or more, when thousands may be entirely sufficient to summarize the curve. After down-sampling, the curves will instead be made of approximately numBins points instead. Points are made from bins of equal numbers of consecutive points. The size of each bin is floor(scoreAndLabels.count() / numBins), which means the resulting number of bins may not exactly equal numBins. The last bin in each partition may be smaller as a result, meaning there may be an extra sample at partition boundaries.
So if you don't set numBins, then precision will be calculated at all the different prediction values. In your case this seems to be just 0.0 and 1.0.
First, try adding more bins like this (here numBins is 10):
val metrics = new BinaryClassificationMetrics(probabilitiesAndLabels,10);
If you still only have two thresholds of 0 and 1, then check to make sure the way you have defined your predictionAndLabels. You many be having this problem if you have accidentally provided (label, prediction) instead of (prediction, label).

Resources