I am trying to develop a method to classify audio using MFCCs in Weka. The MFCCs I have are generated with a buffer size of 1024, so there is a series of MFCC coefficients for each audio recording. I want to convert these coefficients into the ARFF data format for Weka, but I'm not sure how to approach this problem.
I also asked a question about merging the data as well because I feel like this may affect the data conversion to ARFF format.
I know that for an ARFF the data needs to be listed through attributes. Should each coefficient of the MFCC be a separate attribute or an array of the coefficients as a single attribute? Should each data represent a single MFCC, a window of time, or the entire file or sound? Below, I wrote out what I think it should look like if it only took one MFCC into account, which I don't think would be able to classify an entire sound.
#relation audio
#attribute mfcc1 real
#attribute mfcc2 real
#attribute mfcc3 real
#attribute mfcc4 real
#attribute mfcc5 real
#attribute mfcc6 real
#attribute mfcc7 real
#attribute mfcc8 real
#attribute mfcc9 real
#attribute mfcc10 real
#attribute mfcc11 real
#attribute mfcc12 real
#attribute mfcc13 real
#attribute class {bark, honk, talking, wind}
#data
126.347275, -9.709645, 4.2038302, -11.606304, -2.4174862, -3.703139, 12.748064, -5.297932, -1.3114156, 2.1852574, -2.1628475, -3.622149, 5.851326, bark
Any help will be greatly appreciated.
Edit:
I have generated some ARFF files using Weka using openSMILE following a method from this website, but I am not sure how this data would be used to classify the audio because each row of data is 10 milliseconds of audio from the same file. The name attribute of each row is "unknown," which I assume is the attribute that the data would try to classify. How would I be able to classify an overall sound (rather than 10 milliseconds) and compare this to several other overall sounds?
Edit #2: Success!
After more thoroughly reading the website that I found, I saw the Accumulate script and Test and Train data files. The accumulate script put all files generated each set of MFCC data from separate audio files together into one ARFF file. Their file was composed of about 200 attributes with stats for 12 MFCCs. Although I wasn't able to retrieve these stats using OpenSmile, I used Python libraries to do so. The stats were max, min, kurtosis, range, standard deviation, and so on. I accurately classified my audio files using BayesNet and Multilayer Perceptron in Weka, which both yielded 100% accuracy for me.
I don't know much about MFCCs, but if you are trying to classify audio files then each line under #data must represent one audio file. If you used windows of time or only one MFCC for each line under #data then the Weka classifiers would be trying to classify windows of time or MFCCs, which is not what you want. Just in case you are unfamiliar with the format (just linking because I saw you put the features of an audio file on the same line as #data), here is an example where each line represents an Iris Plant:
% 1. Title: Iris Plants Database
%
% 2. Sources:
% (a) Creator: R.A. Fisher
% (b) Donor: Michael Marshall (MARSHALL%PLU#io.arc.nasa.gov)
% (c) Date: July, 1988
%
#RELATION iris
#ATTRIBUTE sepallength NUMERIC
#ATTRIBUTE sepalwidth NUMERIC
#ATTRIBUTE petallength NUMERIC
#ATTRIBUTE petalwidth NUMERIC
#ATTRIBUTE class {Iris-setosa,Iris-versicolor,Iris-virginica}
#DATA
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa
4.4,2.9,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
In terms of addressing your question on what attributes you should use for your audio file, it sounds (no pun intended) like using the MFCC coefficients could work (assuming every audio file has the same number of MFCCs because every piece data/audio file must have the same number of attributes). I would try it out and see how it goes.
EDIT:
If the audio files are not the same size you could:
Cut audio files longer than the shortest audio short. Basically you'd be throwing away the data at the end of the audio files.
Make the number of attributes high enough to fit the longest audio file and put whatever MFCC coefficients represent silence for the unfilled attributes of audio files which are shorted than the longest audio file.
If MFCC values are always within a certain range (e.g. -10 to 10 or something like that) then maybe use a "bag of words" model. Your attributes would represent the number of times an MFCC coefficient falls within a certain range for an audio file. So the first attribute might represent the number of MFCC coefficients which fall between -10 and -9.95, the second attribute, -9.95 to -9.90. So if you had a very short audio file with two MFCCs (not likely, just for example purposes) and one coefficient was 10 and the other was -9.93 then your last attribute would have a value of 1, your second attribute would have a value of 1, but all other attributes would have a value of 0. The downside to this method is that the order of the MFCC coefficients is not taken into account. However, this method works well for text classification even though word order is ignored so who knows, maybe it would work for audio.
Other than that I would see if you get any good answers on your merging question.
Related
I'm quite new to AI and I'm currently developing a model for non-parallel voice conversions. One confusing problem that I have is the use of vocoders.
So my model needs Mel spectrograms as the input and the current model that I'm working on is using the MelGAN vocoder (Github link) which can generate 22050Hz Mel spectrograms from raw wav files (which is what I need) and back. I recently tried WaveGlow Vocoder (PyPI link) which can also generate Mel spectrograms from raw wav files and back.
But, in other models such as,
WaveRNN , VocGAN , WaveGrad
There's no clear explanation about wav to Mel spectrograms generation. Do most of these models don't require the wav to Mel spectrograms feature because they largely cater to TTS models like Tacotron? or is it possible that all of these have that feature and I'm just not aware of it?
A clarification would be highly appreciated.
How neural vocoders handle audio -> mel
Check e.g. this part of the MelGAN code: https://github.com/descriptinc/melgan-neurips/blob/master/mel2wav/modules.py#L26
Specifically, the Audio2Mel module simply uses standard methods to create log-magnitude mel spectrograms like this:
Compute the STFT by applying the Fourier transform to windows of the input audio,
Take the magnitude of the resulting complex spectrogram,
Multiply the magnitude spectrogram by a mel filter matrix. Note that they actually get this matrix from librosa!
Take the logarithm of the resulting mel spectrogram.
Regarding the confusion
Your confusion might stem from the fact that, usually, authors of Deep Learning papers only mean their mel-to-audio "decoder" when they talk about "vocoders" -- the audio-to-mel part is always more or less the same. I say this might be confusing since, to my understanding, the classical meaning of the term "vocoder" includes both an encoder and a decoder.
Unfortunately, these methods will not always work exactly in the same manner as there are e.g. different methods to create the mel filter matrix, different padding conventions etc.
For example, librosa.stft has a center argument that will pad the audio before applying the STFT, while tensorflow.signal.stft does not have this (it would require manual padding beforehand).
An example for the different methods to create mel filters would be the htk argument in librosa.filters.mel, which switches between the "HTK" method and "Slaney". Again taking Tensorflow as an example, tf.signal.linear_to_mel_weight_matrix does not support this argument and always uses the HTK method. Unfortunately, I am not familiar with torchaudio, so I don't know if you need to be careful there, as well.
Finally, there are of course many parameters such as the STFT window size, hop length, the frequencies covered by the mel filters etc, and changing these relative to what a reference implementation used may impact your results. Since different code repositories likely use slightly different parameters, I suppose the answer to your question "will every method do the operation(to create a mel spectrogram) in the same manner?" is "not really". At the end of the day, you will have to settle for one set of parameters either way...
Bonus: Why are these all only decoders and the encoder is always the same?
The direction Mel -> Audio is hard. Not even Mel -> ("normal") spectrogram is well-defined since the conversion to mel spectrum is lossy and cannot be inverted. Finally, converting a spectrogram to audio is difficult since the phase needs to be estimated. You may be familiar with methods like Griffin-Lim (again, librosa has it so you can try it out). These produce noisy, low-quality audio. So the research focuses on improving this process using powerful models.
On the other hand, Audio -> Mel is simple, well-defined and fast. There is no need to define "custom encoders".
Now, a whole different question is whether mel spectrograms are a "good" encoding. Using methods like variational autoencoders, you could perhaps find better (e.g. more compact, less lossy) audio encodings. These would include custom encoders and decoders and you would not get away with standard librosa functions...
I am doing a project in which I want to embed images into a .wav file so that when one sees the spectrogram using certain parameters, they will see the hidden image. My question is, in C++, how can I use the data in a wav file to display a spectrogram without using any signal processing libraries?
An explanation of the math (especially the Hanning window) will also be of great help, I am fairly new to signal processing. Also, since this is a very broad question, detailed steps are preferable over actual code.
Example:
above: output spectrogram;
below: input audio waveform (.wav file)
Some of the steps (write C code for each):
Convert the data into a numeric sample array.
Chop sample array into some size of chunks, (usually) overlapped.
(usually) Window with some window function.
FFT each chunk.
Take the Magnitude.
(usually) Take the Log.
Assemble all the 1D FFT result vectors into a 2D matrix.
Scale.
Color the matrix.
Render the 2D bitmap.
(optional) (optimize by rolling some of the above into a loop.)
Add plot decorations (scale, grid marks, etc.)
I am working on a project that requires extracting MFCC features from an audio stream. The project consists primarily of classification, although in the interest of expanding our dataset I am working on a detection algorithm to isolate the parts of the sound we are interested in classifying.
I am testing out different representations and due to the nature of the data (I wish I could give more details but the professor I am working with would prefer to keep it private I am fairly sure), I would imagine delta coefficients on top of the MFCC coefficients would be helpful.
I am extracting 40 MFCC Coefficients along with 40 Delta coefficients and using those for detection. I have a set of training data that consists of a 40 millisecond window centered around the parts of he audio stream I am interested in. I am then training a GMM on that data.
For testing (and its actual use case) I split a longer audio stream (2 seconds or so) into a sequence of MFCC frames. I extract the log likelihood for each frame and threshold the detection based on the percentiles within a log likelihood score, and I get strange results when delta coefficients are used.
You can ignore the 4 figures on the bottom, those were just for visualizing my threshold scheme.
What I want to know is why does the log likelihood behave so strangely when using delta coefficients compared to when no deltas are used?
Thank you in advance, if you need clarifications please ask.
Look at the amplitudes of your signal. The Delta Coeffs example is suspiciously low compared to the Non-delta. Maybe it's just a noise?
Try to run the system with and without delta on exactly the same recording. It'll be easier to debug.
You could also attach spectrogram-like visualization of your MFCC with delta.
Good afternoon,
Well, i want to perform a Multi-label text Classification, so, i choose MEKA (extension of Weka) to perform this task. However, i need to transform the document to a Vector of words, i use the GUI Weka but as you know it perform just a biary classification, for that i tend to use MEKA to perform this task,
the problem is how i create an arff file with multi labels
here is an example:
this is the text
The addition of FMNH(2), to Vibrio harveyi luciferase at 2A°C in the presence of tetradecanal results in the formation of a highly fluorescent transient species with a spectral distribution indistinguishable from that of the bioluminescence. The bioluminescence reaches maximum intensity in 1.5 s and decays in a complex manner with exponential components of 10(-1) s(-1) , 7 x 10(-3)S(-1). and 7 x10(4)s(-1).
the labels are:
"FM", "Fl", "Ki", "Luc", "Lum", "Time Factors"
the result i want to get:
#attribute L-class {Luc, Lum, Limb,...}
#attribute F-class {FM, Fl, Foot,...}
#attribute o-class{Ki, TimeFactors, Adult, Aged, ...}
#attribute All_words frequency
#data
FM,Fl,Ki,Luc,Lum,TimeFactors,2,4,6,8,8,7,4,0,1,2,2....
The acronyms are the labels, and the numbers are the Frequency of each term occuring in the text.
Someone could help me, i will be really thankful.
This is probably very silly question, but I couldn't find details anywhere.
So I have an audio recording (wav file) that is 3 seconds long. That is my sample and it needs to be classified as [class_A] or [class_B].
By following some tutroial on MFCC, I divided the sample into frames (291 frames to be exact) and I've gotten MFCCs from each frame.
Now I have 291 feature vectors, the length of each vector is 13.
My question is; how exactly do you use those vectors with classifier (k-NN for example)? I have 291 vectors that represent 1 sample. I know how to work with 1 vector for 1 sample, but I don't know what to do if I have 291 of them. I couldn't really find explanation anywhere.
Each of your vectors will represent the spectral characteristics of your audio file, as it varies in time. Depending on the length of your frames, you might want to group some of them (for example by averaging by dimension) to match the resolution with which you want the classifier to work. As an example, think of a particular sound that might have an envelope with an Attack time of 2ms: that may be as fine-grained as you want to get with your time quantization so you could a) group and average the number of MFCC vectors that represent 2ms; or b) recompute the MFCCs with the desired time resolution.
If you really want to keep the resolution that fine, you can concatenate the 291 vectors and treat it like a single vector (of 291 x 13 dimensions), which will probably need a huge dataset to train on.