How do I know which spectrogram frames belong to which audio samples? - audio

I’ve been using this script:
spgram = torchaudio.transforms.Spectrogram(512, hop_length=32)
audio = spgram(audio)
to get the spectrogram of some stereo music audio. I expected that the resulting spectrogram has the shape [2, 257, audio.shape[1]/32] However, that’s not the case. For examples, an audio clip with size [2, 199488] (with sr=24576) yields a spectrogram with size [2, 257, 6241] (note that 199488/32=6234). Why is that? and how can I convert from frame location to sample location?

See center parameter.
whether to pad waveform on both sides so that the t-th frame is centered at time t x hop_length. (Default: True)
So, by default, the signal is padded with zeros. The padding length is probably (win_length - hop_length). This ends up making the result longer by (win_length - hop_length) / hop_length, which is 7 in your case.

Related

Detecting duplicate audio files

I have snippets of audio that are almost the same that I want to group together (samples 5 and 3 below). There are other portions that are similar, but differ (3 and 4, there is a double drum hit at the end for 3) and completely different ones (sample 8).
How can I group together samples that are almost the same? I tried taking the difference (attempting to minimize it), but that does not work since they are not aligned. I also tried to take audio features like pitch distribution, but since the sounds are similar in pitch those don't get separated well.
The files are available here: https://drive.google.com/drive/folders/14UQQDfIBUNRO_1Pv8bkPf9noi86M7lKd
Here's something that appears to work for the data you are using but may (likely does) have weaknesses when it comes to other data or other sorts of data. But maybe it will be helpful nonetheless.
The basic idea of this solution is to compute the MFCCs of each of the samples to get feature vectors and then find a distance (here just using basic Euclidean distance) between those feature sets with the assumption (which seems to be true for your data) that the least similar samples will have a large distance and the closest will have the least. Here's the code:
import librosa
import scipy
import matplotlib.pyplot as plt
sample3, rate = librosa.load('sample3.wav', sr=None)
sample4, rate = librosa.load('sample4.wav', sr=None)
sample5, rate = librosa.load('sample5.wav', sr=None)
sample8, rate = librosa.load('sample8.wav', sr=None)
# cut the longer sounds to same length as the shortest
len5 = len(sample5)
sample3 = sample3[:len5]
sample4 = sample4[:len5]
sample8 = sample8[:len5]
mf3 = librosa.feature.mfcc(sample3, sr=rate)
mf4 = librosa.feature.mfcc(sample4, sr=rate)
mf5 = librosa.feature.mfcc(sample5, sr=rate)
mf8 = librosa.feature.mfcc(sample8, sr=rate)
# average across the frames. dubious?
amf3 = mf3.mean(axis=0)
amf4 = mf4.mean(axis=0)
amf5 = mf5.mean(axis=0)
amf8 = mf8.mean(axis=0)
f_list = [amf3, amf4, amf5, amf8]
results = []
for i, features_a in enumerate(f_list):
results.append([])
for features_b in f_list:
result = scipy.spatial.distance.euclidean(features_a,
features_b)
results[i].append(result)
plt.ion()
fig, ax = plt.subplots()
ax.imshow(results, cmap='gray_r', interpolation='nearest')
spots = [0, 1, 2, 3]
labels = ['s3', 's4', 's5', 's8']
ax.set_xticks(spots)
ax.set_xticklabels(labels)
ax.set_yticks(spots)
ax.set_yticklabels(labels)
The code plots a heatmap of the distances across all the samples. The code is lazy so it both re-computes the elements that are symmetric across the diagonal, which are the same, and the diagonal itself (which should be zero distance) but those are sort of sanity checks as it is nice to see white down the diagonal and that the matrix is symmetric.
The real information is that clip 8 is black against all the other clips (i.e. furthest from them) and clip 3 and clip 5 are the least distant from one another.
This basic idea could be done with a feature vector generated in a different sort of way (e.g. instead of MFCCs, you could use the embeddings from something like YAMNet) or with a different way of finding a distance between the feature vectors.
For the grouping part of what you want to do, you could experimentally work out a threshold on the distance metric below which you would consider a clip to be in the same group as another. With more clips, you could compute all these distances and then hand that distance matrix over to a clustering algorithm (like HDBSCAN) to cluster the clips.

What is the upsampling method called 'area' used for?

The PyTorch function torch.nn.functional.interpolate contains several modes for upsampling, such as: nearest, linear, bilinear, bicubic, trilinear, area.
What is the area upsampling modes used for?
As jodag said, it is resizing using adaptive average pooling. While the answer at the link aims to explain what adaptive average pooling is, I find the explanation a bit vague.
TL;DR the area mode of torch.nn.functional.interpolate is probably one of the most intuitive ways to think of when one wants to downsample an image.
You can think of it as applying an averaging Low-Pass Filter(LPF) to the original image and then sampling. Applying an LPF before sampling is to prevent potential aliasing in the downsampled image. Aliasing can result in Moiré patterns in the downscaled image.
It is probably called "area" because it (roughly) preserves the area ratio between the input and output shapes when averaging the input pixels. More specifically, every pixel in the output image will be the average of a respective region in the input image where the 1/area of this region will be roughly the ratio between output image's area and input image's area.
Furthermore, the interpolate function with mode = 'area' calls the source function adaptie_avg_pool2d (implemented in C++) which assigns each pixel in the output tensor the average of all pixel intensities within a computed region of the input. That region is computed per pixel and can vary in size for different pixels. The way it is computed is by multiplying the output pixel's height and width by the ratio between the input and output (in that order) height and width (respectively) and then taking once the floor (for the region's starting index) and once the ceil (for the region's ending index) of the resulting value.
Here's an in-depth analysis of what happens in nn.AdaptiveAvgPool2d:
First of all, as stated there you can find the source code for adaptive average pooling (in C++) here: source
Taking a look at the function where the magic happens (or at least the magic on CPU for a single frame), static void adaptive_avg_pool2d_single_out_frame, we have 5 nested loops, running over channel dimension, then width, then height and within the body of the 3rd loop the magic happens:
First compute the region within the input image which is used to calculate the value of the current pixel (recall we had width and height loop to run over all pixels in the output).
How is this done?
Using a simple computation of start and end indices for height and width as follows: floor((input_height/output_height) * current_output_pixel_height) for the start and ceil((input_height/output_height) * (current_output_pixel_height+1)) and similarly for the width.
Then, all that is done is to simply average the intensities of all pixels in that region and current channel and place the result in the current output pixel.
I wrote a simple Python snippet that does the same thing, in the same fashion (loops, naive) and produces equivalent results. It takes tensor a and uses adaptive average pool to resize a to shape output_shape in 2 ways - once using the built-in nn.AdaptiveAvgPool2d and once with my translation into Python of the source function in C++: static void adaptive_avg_pool2d_single_out_frame. Built-in function's result is saved into b and my translation is saved into b_hat. You can see that the results are equivalent (you can further play with the spatial shapes and validate this):
import torch
from math import floor, ceil
from torch import nn
a = torch.randn(1, 3, 15, 17)
out_shape = (10, 11)
b = nn.AdaptiveAvgPool2d(out_shape)(a)
b_hat = torch.zeros(b.shape)
for d in range(a.shape[1]):
for w in range(b_hat.shape[3]):
for h in range(b_hat.shape[2]):
startW = floor(w * a.shape[3] / out_shape[1])
endW = ceil((w + 1) * a.shape[3] / out_shape[1])
startH = floor(h * a.shape[2] / out_shape[0])
endH = ceil((h + 1) * a.shape[2] / out_shape[0])
b_hat[0, d, h, w] = torch.mean(a[0, d, startH: endH, startW: endW])
'''
Prints Mean Squared Error = 0 (or a very small number, due to precision error)
as both outputs are the same, proof of output equivalence:
'''
print(nn.MSELoss()(b_hat, b))
Looking at the source code it appears area interpolation is equivalent to resizing a tensor via adaptive average pooling. You can refer to this question for an explanation of adaptive average pooling. Therefore area interpolation is more applicable to downsampling than upsampling.

Getting multiple sinosidual waves using Fourier Transform Python

According to http://www.thefouriertransform.com/
" The Fourier Transform shows that any waveform can be re-written as the sum of sinusoidal functions. "
I have some signals (each have shape of 256,64) that I want to break down into sub-signals and I want to use those sub-signals then to generate the real signal back. I am doing it right now like this:-
#getting data
with open('../f', 'rb') as fp:
f=pickle.load(fp)
from scipy.fftpack import fft, dct
f=f[0]
tf=fft(f)
x=np.reshape(np.abs(tf),(256,64))
plt.plot(x)
plt.show()
print(x.shape) #same shape as f
But I am getting output in the same shape as of the real signal but with some imaginary values which are discarded ultimately. I have looked at other Fourier questions here but none of them gave satisfying result, they just transformed the input signal. What am I doing wrong? Any help will be much appreciated.
To see the sinusoidal components, you need to plot sine waves.
x = a * sin(t)
not a reshaped FFT result.
If you don't care about phase, the number of sinewave plots will be half the length of your FFT + 1, which each sinewave of a frequency calculated from the bin center of each FFT result element (index times samplerate divided by length), and its amplitude given by the abs() of the FFT bin.

How to program sound card to output a specific signal?

I'm trying to program my sound card to output specific values.
Let's say, I have below sequence
[1,2,3,2,10,4,1,50,20,1]
I want the sound card to output the specified analog signal according to this sequence.
I can use Windows Multimedia API of course. However, my task is light-weighted and I don't want to use such heavy framework.
Any suggestions on this?
I propose you generate a .wav file and play it with a media player.
It's easy with python and its wave module. In my below example I worked with python 3.3
import wave
import struct
# define your sequence
frames = [1,2,3,2,10,4,1,50,20,1]
output = wave.open('out.wav', mode='wb') # create the file that will contain your sequence
output.setnchannels(1) # 1 for mono
output.setsampwidth(1) # resolution in number of bytes
output.setframerate(10) # sampling rate, being usually 44100 Hz
output.setnframes(len(frames)) # sequence length
for i in range(0, len(frames)):
output.writeframes(struct.pack('h',frames[i])) # convert to string in hex format
# close the file, you're done
output.close()
You can do this in one line if you use Matlab or the free equivalent, Octave. The relevant documentation is here.
soundsc(x, fs, [ lo, hi ])
Scale the signal so that [lo, hi] -> [-1, 1], then play it
at sampling rate fs. If fs is empty, then the default 8000 Hz
sampling rate is used.
Your function call in the console would look like this ...
soundsc([1,2,3,2,10,4,1,50,20,1], fs, [1 50]);
... or like this with manual normalisation of the positive integer vector to give values between +/- 1 ...
x = [1,2,3,2,10,4,1,50,20,1];
x=x-min(x); % get values to range from zero up
x=x/max(x); % get floating point values to range between 0.0 and 1.0
x=x*2-1; % get values to range between +/- 1.0;
soundsc(x);

Libsox encoding

Why do i get distorted output if I convert a wav file using libsox to:
&in->encoding.encoding = SOX_ENCODING_UNSIGNED;
&in->encoding.bits_per_sample = 8;
using the above code?
The input file has bits_per_sample = 16.
So you're saying that you tell SOX to read a 16 bit sample WAV file as an 8 bit sample file? Knowing nothing about SOX, I would expect it to read each 16 bit sample as two 8 bit samples... the high order byte and the low order byte like this: ...HLHLHLHLHL...
For simplicity, we'll call high order byte samples 'A' samples. 'A' samples carry the original sound with less dynamic range, because the low order byte with the extra precision has been chopped off.
We'll call the low order byte samples "B samples." These will be roughly random and encode noise.
So, as a result we'll have the original sound, the 'A' samples, shifted down in frequency by a half. This is because there's a 'B' sample between every 'A' sample which halves the rate of the 'A' samples. The 'B' samples add noise to the original sound. So we'll have the original sound, shifted down by a half, with noise.
Is that what you're hearing?
Edit Guest commented that the goal is to downconvert a WAV to 8 bit audio. Reading the manpage for SoX, it looks like SoX always uses 32 bit audio in memory as a result of sox_read(). Passing it a format will only make it attempt to read from that format.
To downconvert in memory, use SOX_SAMPLE_TO_SIGNED_8BIT or SOX_SAMPLE_TO_UNSIGNED_8BIT from sox.h, ie:
sox_format_t ft = sox_open_read("/file/blah.wav", NULL, NULL);
if( ft ) {
sox_ssample_t buffer[100];
sox_size_t amt = sox_read(ft, buffer, sizeof(buffer));
char 8bitsample = SOX_SAMPLE_TO_SIGNED_8BIT(buffer[0], ft->clips);
}
to output a downconverted file, use the 8 bit format when writing instead of when reading.

Resources