I have MFCC (Mel-frequency cepstral coefficient) files generated by HTK from .wav files. What I need is to extract a time span from the MFCC. When the MFCC file represents audio of 90 minutes length, then I want to get e.g. MFCC for the third minute of the audio.
The HTK book says the MFCC file consists of a header and a contiguous sequence of samples. But determining the exact size of a sample in bytes doesn't seem trivial.
Is there perhaps a parser for the files? (Of course there is, in HTK, but I didn't manage to figure out how to use the binaries for this task.)
Or is there maybe an easy way to determine the size of a sample and of the header, as to be able to simply cut the file apart?
Figured it out. HTK has a tool for that. HCopy can convert MFCC to MFCC and accepts parameters for start and end.
HCopy -C config0 -s 10e7 -e 11e7 source.mfcc target.mfcc
cuts 00:10 .. 00:11 from source.
config0 should contain the same configuration that was used for creating the original mfcc's from wav, except for the sourcekind set to wav.
Related
I am doing a project in which I want to embed images into a .wav file so that when one sees the spectrogram using certain parameters, they will see the hidden image. My question is, in C++, how can I use the data in a wav file to display a spectrogram without using any signal processing libraries?
An explanation of the math (especially the Hanning window) will also be of great help, I am fairly new to signal processing. Also, since this is a very broad question, detailed steps are preferable over actual code.
Example:
above: output spectrogram;
below: input audio waveform (.wav file)
Some of the steps (write C code for each):
Convert the data into a numeric sample array.
Chop sample array into some size of chunks, (usually) overlapped.
(usually) Window with some window function.
FFT each chunk.
Take the Magnitude.
(usually) Take the Log.
Assemble all the 1D FFT result vectors into a 2D matrix.
Scale.
Color the matrix.
Render the 2D bitmap.
(optional) (optimize by rolling some of the above into a loop.)
Add plot decorations (scale, grid marks, etc.)
I have tried the Watson Speech to Text API for MP3 as well as WAV files. As per my observation, the same length of audio takes less time if its given in MP3 format as compared to WAV. 10 consecutive API calls with different audios took on an average 8.7 seconds for MP3 files. On the other hand the same input in WAV format took average 11.1 seconds. Does the service response time depend on the file type? Which file type is recommended to use to obtain the results faster?
Different encoding formats have different bitrates. mp3 and opus are lossy compression formats (although suitable for speech recognition when bitrates are not too low) so they offer the lowest bitrates. If you need to push less bytes over the network that is typically better for latency, so depending on your network speed you can see shorter processing times when using encoding with lower bitrates.
However, regarding the actual speech recognition process (ignoring the data transfer over the network) all encodings are equally fast since before the recognition starts all the audio is uncompressed, if necessary, and converted to the sampling rate of the target model (broadband or narrowband).
I am trying to convert a .wav music file into something playable at beep command.
I need to export the frequencies to a text format to use as input parameters at beep.
Ps.: It is not about Speech Transcription.
The beep command in linux is only to control de pc-speaker. It only allows one frequency simultaneously and doesn't apply. A wav file is a file of samples that normally carries music (music is made of a lot of simultaneous frequencies)
You cannot convert a wav file to play it on the pc-speaker. You need a sound card to do that.
As you say, it's not voice recognition, but even in that case, a violin simple note sounds different than a guitar one, because it carries not only a single frequency in it. There are what is called harmonics, different components at different frequencies (normally multiples of the original frequency) that makes the sound different (not only the frequencies matter, also the relative intensities of them) and that is impossible to reproduce with a tool that only allows you to play a single frequency, with a given shape (the wave is not sinusoidal, but have several already included harmonics, that make it sound like a pc speaker) and no intensity capable.
Overview:
I have about 1000 MP3 files that I need to perform noise removal on.
I have used Audacity in the past for individual noise removal operations but Audacity will not cut it for this job.
Audacity is unable to perform bulk operations and I don't have the time to perform this manually on 1000s of MP3 files.
A little about the noise:
The noise is similar to white noise but it differs slightly in every MP3 file, so a different noise profile will need to be built for each MP3.
The noise comes from a fan in the background (if you were wondering).
Question:
What is the best way to automate nose removal from the MP3 files?
You could try using Sox. It's a command line application so is scriptable. See here for further info.
I was reading THIS TUTORIAL on wav files and I have some confusions.
Suppose I use PCM_16_BIT as my encoding format. So this should mean each of my sound samples need 16 bits to represent them shouldn't it?
But in this tutorial, the second figure shows 4 bytes as one sample. Why is that? I suppose because it is trying to show the format for a stereo recorded wav file, but what if I have a mono recorded wav file? Are the left and right channel values equal in this case, or one of the channel values is 0? How does it work?
Yes, for 16bit stereo you need 4 bytes. For mono, you just need two bytes for 16bit PCM. Check this out:
http://www.codeproject.com/Articles/501521/How-to-convert-between-most-audio-formats-in-NET
Also read here:
http://wiki.multimedia.cx/index.php?title=PCM