Why is stft(istft(x)) ≠ x? - audio

Why is stft(istft(x)) ≠ x?
Using PyTorch, I have computed the short-time Fourier transform of the inverse short-time Fourier transform of a tensor.
I have done so as shown below given a tensor x.
For x, the real and imaginary part are equal, or the imaginary part is set to zero -- both produces the same problem.
torch.stft(torchaudio.functional.istft(x, n_fft), n_fft)
As shown in the image, only a single one of the stripes in the tensor remains after applying stft(istft(x)) -- all other stripes vanish.
If stft(istft(x)) (bottom) was equal to x (top), both images would look similar.
Why are they so different?
It seems like stft(istft(x)) can pick up only certain frequencies of x.
x (top) and stft of istft of x (bottom)
I have also tried the same with scipy.signal.istft and scipy.signal.stft which causes the same issue.
Moreover, I have tried it with a wide range of tensors x, e.g., different randomized distributions, images, and other stripes.
Also, I have tried a variety of hyper-parameters for stft/istft.
Only for x generated by a short-time Fourier transform from a sound wave, it works.

A short-time Fourier transform produces a more data than there is in the original signal. Where a signal has N real samples, then STFT might have 4N complex samples -- 8 times more data.
It follows that the ISTFT operation must discard 7/8 of the data you provide it.
Most of the data in a STFT is redundant, and if you just make up values for all of the data, it is unlikely to correspond to a real signal.
In that case, an implementation of ISTFT will probably use a least-squares fit or other method of producing a signal with an STFT that matches your data as closely as possible, but it won't always be close.

Related

Converting intensities to probabilities in ppp

Apologies for the overlap with existing questions; mine is at a more basic skill level. I am working with very sparse occurrences spanning very large areas, so I would like to calculate probability at pixels using the density.ppp function (as opposed to relrisk.ppp, where specifying presences+absences would be computationally intractable). Is there a straightforward way to convert density (intensity) to probabilities at each point?
Maxdist=50
dtruncauchy=function(x,L=60) L/(diff(atan(c(-1,1)*Maxdist/L)) * (L^2 + x^2))
dispersfun=function(x,y) dtruncauchy(sqrt(x^2+y^2))
n=1e3; PPP=ppp(1:n,1:n, c(1,n),c(1,n), marks=rep(1,n));
density.ppp(PPP,cutoff=Maxdist,kernel=dispersfun,at="points",leaveoneout=FALSE) #convert to probabilies?
Thank you!!
I think there is a misunderstanding about fundamentals. The spatstat package is designed mainly for analysing "mapped point patterns", datasets which record the locations where events occurred or things were located. It is designed for "presence-only" data, not "presence/absence" data (with some exceptions).
The relrisk function expects input data about the presence of two different types of events, such as the mapped locations of trees belonging to two different species, and then estimates the spatially-varying probability that a tree will belong to each species.
If you have 'presence-only' data stored in a point pattern object X of class "ppp", then density(X, ....) will produce a pixel image of the spatially-varying intensity (expected number of points per unit area). For example if the spatial coordinates were expressed in metres, then the intensity values are "points per square metre". If you want to calculate the probability of presence in each pixel (i.e. for each pixel, the probability that there is at least one presence point in the pixel), you just need to multiply the intensity value by the area of one pixel, which gives the expected number of points in the pixel. If pixels are small (the usual case) then the presence probability is just equal to this value. For physically larger pixels the probability is 1 - exp(-m) where m is the expected number of points.
Example:
X <- redwood
D <- density(X, 0.2)
pixarea <- with(D, xstep * ystep)
M <- pixarea * D
p <- 1 - exp(-M)
then M and p are images which should be almost equal, and can both be interpreted as probability of presence.
For more information see Chapter 6 of the spatstat book.
If, instead, you had a pixel image of presence/absence data, with pixel values equal to 1 or 0 for presence or absence respectively, then you can just use the function blur in the spatstat package to perform kernel smoothing of the image, and the resulting pixel values are presence probabilities.

I get strange sizzle/artifacts in my audio when doing differnt FFT approaches

I am doing filter convolution by using fft (FFTW). I experience something I can not understand.
I have an input x(n) which I want to apply a filter IR u(n). Both length N. So I zero pad both e.g. to 2n and do FFT of both to get X(n) and U(n). if I just do X(n)*U(n) and IFFT I get a signal y(t). If I listen to the signal there is no sizzling, all sounds ok. For speeding up the programm and saving memory I tried to take advantage of symmetrie of U(n) and X(n)and to use only first half of U(n) and X(n) and zero padding the second half. So I did X(n0...n/2,0,0,0,0,..,N)U(n0,..,n/2,0,0,0,..,N) and IFFT.
The resulting output sounds not different to the result when multipling full length XU but there is strange subtile sizzling noise audible laying on the output. Mostley apparent on loud/resonant input signal parts, sounds almost like clipping the stage. I did not change anything in the scaling in both methods so, I don´t understand whats going on. Could someone help me out with an idea?
Is it wrong to just use half of U and X and zero pad the rest , must I use the full length? Or does this change e.g. scaling?
You can not simply set part of your signal spectra to zero. Any real signal (with no imaginary component) has a conjugate complex spectrum. I guess this is the symmetry you are talking about. If you set part of the spectrum to zero your signal in the time domain will be complex and completely different from the original signal you started with.
If you want to speed up your computation reduce the number of your samples you are working with

Normalizing audio waveforms code implementation (Peak, RMS)

I have some audio data (array of floats) which I use to plot a simple
waveform.
When plotted, the waveform doesn't max out at the edges.
No problem - the data just needs to be normalized. I iterate once to find the max, and then iterate again dividing each by the max. Plot again and everything looks great!
But wait videos which have a loud intro, or loud explosion, causes the rest of the waveform to still be tiny.
After some research, I come across RMS that is supposed to address this. I iterate through the samples and calculate the RMS, and again divide each sample by the RMS value. This results in considerable "clipping":
What is the best method to solve this?
Intuitively, it seems I might need to calculate a local max or average based on a moving window (rather than the entire set) but I'm not entirely sure. Help?
Note: The waveform is purely for visual purposes (the audio will not be played back to the user).
You could transpose it (effectively making the y-axis non-linear, or you can think it as a form of companding).
Assuming the signal is within the range [-1, 1].
One popular quick and simple solution is to simply apply the hyperbolic tangens function (tanh). This will limit values to [-1, 1] by penalizing higher values more. If you amplify the signal before applying tanh, the effect will be more pronounced.
Another alternative is a logarithmic transform. As the signal changes sign some pre-processing has to be performed.
If r is a series of sample values one approach could be something like this:
r.log1p <- log2(1.1 * (abs(r) + 1)) * sign(r)
That is, for every value take its absolute, add one, multiply with some small constant, take the log and then finally multiply it with the sign of its corresponding old value.
The effect can be something like this:

"Winamp style" spectrum analyzer

I have a program that plots the spectrum analysis (Amp/Freq) of a signal, which is preety much the DFT converted to polar. However, this is not exactly the sort of graph that, say, winamp (right at the top-left corner), or effectively any other audio software plots. I am not really sure what is this sort of graph called (if it has a distinct name at all), so I am not sure what to look for.
I am preety positive about the frequency axis being base two exponential, the amplitude axis puzzles me though.
Any pointers?
Actually an interesting question. I know what you are saying; the frequency axis is certainly logarithmic. But what about the amplitude? In response to another poster, the amplitude can't simply be in units of dB alone, because dB has no concept of zero. This introduces the idea of quantization error, SNR, and dynamic range.
Assume that the received digitized (i.e., discrete time and discrete amplitude) time-domain signal, x[n], is equal to s[n] + e[n], where s[n] is the transmitted discrete-time signal (i.e., continuous amplitude) and e[n] is the quantization error. Suppose x[n] is represented with b bits, and for simplicity, takes values in [0,1). Then the maximum peak-to-peak amplitude of e[n] is one quantization level, i.e., 2^{-b}.
The dynamic range is the defined to be, in decibels, 20 log10 (max peak-to-peak |s[n]|)/(max peak-to-peak |e[n]|) = 20 log10 1/(2^{-b}) = 20b log10 2 = 6.02b dB. For 16-bit audio, the dynamic range is 96 dB. For 8-bit audio, the dynamic range is 48 dB.
So how might Winamp plot amplitude? My guesses:
The minimum amplitude is assumed to be -6.02b dB, and the maximum amplitude is 0 dB. Visually, Winamp draws the window with these thresholds in mind.
Another nonlinear map, such as log(1+X), is used. This function is always nonnegative, and when X is large, it approximates log(X).
Any other experts out there who know? Let me know what you think. I'm interested, too, exactly how this is implemented.
To generate a power spectrum you need to do the following steps:
apply window function to time domain data (e.g. Hanning window)
compute FFT
calculate log of FFT bin magnitudes for N/2 points of FFT (typically 10 * log10(re * re + im * im))
This gives log magnitude (i.e. dB) versus linear frequency.
If you also want a log frequency scale then you will need to accumulate the magnitude from appropriate ranges of bins (and you will need a fairly large FFT to start with).
Well I'm not 100% sure what you mean but surely its just bucketing the data from an FFT?
If you want to get the data such that you have (for a 44Khz file) frequency points at 22Khz, 11Khz 5.5Khz etc then you could use a wavelet decomposition, i guess ...
This thread may help ya a bit ...
Converting an FFT to a spectogram
Same sort of information as a spectrogram I'd guess ...
What you need is power spectrum graph. You have to compute DFT of your signal's current window. Then square each value.

Identifying common periodic waveforms (square, sine, sawtooth, ...)

Without any user interaction, how would a program identify what type of waveform is present in a recording from an ADC?
For the sake of this question: triangle, square, sine, half-sine, or sawtooth waves of constant frequency. Level and frequency are arbitrary, and they will have noise, small amounts of distortion, and other imperfections.
I'll propose a few (naive) ideas, too, and you can vote them up or down.
You definitely want to start by taking an autocorrelation to find the fundamental.
With that, take one period (approximately) of the waveform.
Now take a DFT of that signal, and immediately compensate for the phase shift of the first bin (the first bin being the fundamental, your task will be simpler if all phases are relative).
Now normalise all the bins so that the fundamental has unity gain.
Now compare and contrast the rest of the bins (representing the harmonics) against a set of pre-stored waveshapes that you're interested in testing for. Accept the closest, and reject overall if it fails to meet some threshold for accuracy determined by measurements of the noisefloor.
Do an FFT, find the odd and even harmonic peaks, and compare the rate at which they decrease to a library of common waveform.. peak... ratios.
Perform an autocorrelation to find the fundamental frequency, measure the RMS level, find the first zero-crossing, and then try subtracting common waveforms at that frequency, phase, and level. Whichever cancels out the best (and more than some threshold) wins.
This answer presumes no noise and that this is a simple academic exercise.
In the time domain, take the sample by sample difference of the waveform. Histogram the results. If the distribution has a sharply defined peak (mode) at zero, it is a square wave. If the distribution has a sharply defined peak at a positive value, it is a sawtooth. If the distribution has two sharply defined peaks, one negative and one positive,it is a triangle. If the distribution is broad and is peaked at either side, it is a sine wave.
arm yourself with more information...
I am assuming that you already know that a theoretically perfect sine wave has no harmonic partials (ie only a fundamental)... but since you are going through an ADC you can throw the idea of a theoretically perfect sine wave out the window... you have to fight against aliasing and determining what are "real" partials and what are artifacts... good luck.
the following information comes from this link about csound.
(*) A sawtooth wave contains (theoretically) an infinite number of harmonic partials, each in the ratio of the reciprocal of the partial number. Thus, the fundamental (1) has an amplitude of 1, the second partial 1/2, the third 1/3, and the nth 1/n.
(**) A square wave contains (theoretically) an infinite number of harmonic partials, but only odd-numbered harmonics (1,3,5,7,...) The amplitudes are in the ratio of the reciprocal of the partial number, just as sawtooth waves. Thus, the fundamental (1) has an amplitude of 1, the third partial 1/3, the fifth 1/5, and the nth 1/n.
I think that all of these answers so far are quite bad (including my own previous...)
after having thought the problem through a bit more I would suggest the following:
1) take a 1 second sample of the input signal (doesn't need to be so big, but it simplifies a few things)
2) over the entire second, count the zero-crossings. at this point you have the cps (cycles per second) and know the frequency of the oscillator. (in case that's something you wanted to know)
3) now take a smaller segment of the sample to work with: take precisely 7 zero-crossings worth. (so your work buffer should now, if visualized, look like one of the graphical representations you posted with the original question.) use this small work buffer to perform the following tests. (normalizing the work buffer at this point could make life easier)
4) test for square-wave: zero crossings for a square wave are always very large differences, look for a large signal delta followed by little to no movement until the next zero crossing.
5) test for saw-wave: similar to square-wave, but a large signal delta will be followed by a linear constant signal delta.
6) test for triangle-wave: linear constant (small) signal deltas. find the peaks, divide by the distance between them and calculate what the triangle wave should look like (ideally) now test the actual signal for deviance. set a deviance tolerance threshold and you can determine whether you are looking at a triangle or a sine (or something parabolic).
First find the base frequency and the phase. You can do that with FFT. Normalize the sample. Then subtract each sample with the sample of the waveform you want to test (same frequency and same phase). Square the result add it all up and divide it by the number of samples. The smallest number is the waveform you seek.

Resources