This question already has answers here:
What do the bytes in a .wav file represent?
(6 answers)
Closed last year.
Each WAV files depends on a Sampling Rate and a Bit Depth. The former governs how many different samples are played per second, and the latter governs how many possibilities there are for each timeslot.
For sampling rate for example 1000 Hz and the bit depth is 8 then each 1/1000 of a second the audio device plays one of a possible $2^8$ different sounds.
Hence the bulk of the WAV file is a sequence of 8-bit numbers. There is also a header which contains the Sampling Rate and Bit Depth and other specifics of how the data should be read:
The above comes from running xxd on a wav file to view it in binary on the terminal. The first column is just increments of 6 in hexadecimal. The last one seems to say where the header ends. So the data looks like this:
Each of those 8-bit numbers is a sample. So the device reads left-to right and converts the samples in order into sounds. But how in principle can each number correspond to a sound. I would think each bit should somehow encode an amplitude and a pitch, with each coming from a finite range. But I can not find any reference to for example the first half of the bits being a pitch and the second being a frequency.
I have found references to the numbers encoding "signal strength" but I do not know what this means.Can anyone explain in principle how the data is read and converted to audio?
In your example, over the course of a second, 1000 values are sent to a DAC (Digital to Analog converter) where the discrete values are smoothed out into a waveform. The pitch is determined by the rate and pattern by which the stream of values (which get smoothed out to a wave) rise and fall.
Steve W. Smith gives some good diagrams and explanations in his chapter ADC and DCA from his very helpful book The Scientists and Engineers Guide to Digital Signal Processing.
I need to take short sound samples every 5 seconds, and then upload these to our cloud server.
I then need to find a way to compare / check if that sample is part of a full long audio file.
The samples will be recorded from a phones microphone, so they will indeed not be exact.
I know this topic can get quite technical and complex, but I am sure there must be some libraries or online services that can assist in this complex audio matching / pairing.
One idea was to use a audio to text conversion service and then do matching based on the actual dialog. However this does not feel efficient to me. Where as matching based on actual sound frequencies or patterns would be a lot more efficient.
I know there are services out there such as Shazam that do this type of audio matching. However I would imagine their services are all propriety.
Some factors that could influence it:
Both audio samples with be timestamped. So we donot have to search through the entire sound clip.
To give you traction on getting an answer you need to focus on an answerable question where you have done battle and show your code
Off top of my head I would walk across the audio to pluck out a bucket of several samples ... then slide your bucket across several samples and perform another bucket pluck operation ... allow each bucket to contain overlap samples also contained in previous bucket as well as next bucket ... less samples quicker computation more samples greater accuracy to an extent YMMV
... feed each bucket into a Fourier Transform to render the time domain input audio into its frequency domain counterpart ... record into a database salient attributes of the FFT of each bucket like what are the X frequencies having most energy (greatest magnitude on your FFT)
... also perhaps store the standard deviation of those top X frequencies with respect to their energy (how disperse are those frequencies) ... define additional such attributes as needed ... for such a frequency domain approach to work you need relatively few samples in each bucket since FFT works on periodic time series data so if you feed it 500 milliseconds of complex audio like speech or music you no longer have periodic audio, instead you have mush
Then once all existing audio has been sent through above processing do same to your live new audio then identify what prior audio contains most similar sequence of buckets matching your current audio input ... use a Bayesian approach so your guesses have probabilistic weights attached which lend themselves to real-time updates
Sounds like a very cool project good luck ... here are some audio fingerprint resources
does audio clip A appear in audio file B
Detecting audio inside audio [Audio Recognition]
Detecting audio inside audio [Audio Recognition]
Detecting a specific pattern from a FFT in Arduino
Detecting a specific pattern from a FFT in Arduino
Audio Fingerprinting using the AudioContext API
https://news.ycombinator.com/item?id=21436414
https://iq.opengenus.org/audio-fingerprinting/
Chromaprint is the core component of the AcoustID project.
It's a client-side library that implements a custom algorithm for extracting fingerprints from any audio source
https://acoustid.org/chromaprint
Detecting a specific pattern from a FFT
Detecting a specific pattern from a FFT in Arduino
Audio landmark fingerprinting as a Node Stream module - nodejs converts a PCM audio signal into a series of audio fingerprints.
https://github.com/adblockradio/stream-audio-fingerprint
SO followup
How to compare / match two non-identical sound clips
How to compare / match two non-identical sound clips
Audio fingerprinting and recognition in Python
https://github.com/worldveil/dejavu
Audio Fingerprinting with Python and Numpy
http://willdrevo.com/fingerprinting-and-audio-recognition-with-python/
MusicBrainz: an open music encyclopedia (musicbrainz.org)
https://news.ycombinator.com/item?id=14478515
https://acoustid.org/chromaprint
How does Chromaprint work?
https://oxygene.sk/2011/01/how-does-chromaprint-work/
https://acoustid.org/
MusicBrainz is an open music encyclopedia that collects music metadata and makes it available to the public.
https://musicbrainz.org/
Chromaprint is the core component of the AcoustID project.
It's a client-side library that implements a custom algorithm for extracting fingerprints from any audio source
https://acoustid.org/chromaprint
Audio Matching (Audio Fingerprinting)
Is it possible to compare two similar songs given their wav files?
Is it possible to compare two similar songs given their wav files?
audio hash
https://en.wikipedia.org/wiki/Hash_function#Finding_similar_records
audio fingerprint
https://encrypted.google.com/search?hl=en&pws=0&q=python+audio+fingerprinting
ACRCloud
https://www.acrcloud.com/
How to recognize a music sample using Python and Gracenote?
Audio landmark fingerprinting as a Node Stream module - nodejs converts a PCM audio signal into a series of audio fingerprints.
https://github.com/adblockradio/stream-audio-fingerprint
I have tried the Watson Speech to Text API for MP3 as well as WAV files. As per my observation, the same length of audio takes less time if its given in MP3 format as compared to WAV. 10 consecutive API calls with different audios took on an average 8.7 seconds for MP3 files. On the other hand the same input in WAV format took average 11.1 seconds. Does the service response time depend on the file type? Which file type is recommended to use to obtain the results faster?
Different encoding formats have different bitrates. mp3 and opus are lossy compression formats (although suitable for speech recognition when bitrates are not too low) so they offer the lowest bitrates. If you need to push less bytes over the network that is typically better for latency, so depending on your network speed you can see shorter processing times when using encoding with lower bitrates.
However, regarding the actual speech recognition process (ignoring the data transfer over the network) all encodings are equally fast since before the recognition starts all the audio is uncompressed, if necessary, and converted to the sampling rate of the target model (broadband or narrowband).
A MP3 file header only contain the sample rate and bit rate, so the decoder can't figure out the bit depth from the header. Maybe it can only guess from the bit rate? But the bit rate is varying from frame to frame.
Here is another way to ask this question: If I encoder an 24 bit WAV to mp3, so how does the 24-bit info stored in this mp3?
When the source WAV is compressed, the original bit depth information is "thrown away". This is by design in any compressed audio codec since the whole point is to use the least bits possible to store the "same" audio.
Internally, MP3 uses Huffman symbols to store the processed audio data. As such, there's no real "bit depth" to report.
During the encoding process, the samples are quantized, so the original bit depth information is lost.
MP3 decoders either choose a bitdepth they operate at, or allow the end-user/application to dictate it. The bitdepth is determined during "re-quantization".
Have a read of http://blog.bjrn.se/2008/10/lets-build-mp3-decoder.html which is rather enlightening
Some years ago I made a music audio recording, and I can't find the original WAV files, I have only compressed MP3s. Now I found an audio CD, but I don't know if it was made using the original, uncompressed WAVs, or if it was made from compressed MP3 or OGG files.
Is there a way how to detect if an audio sample has been compressed and decompressed using a lossy compression such as MP, OGG, ..., without having the original to compare to?
Update:
Trying #MisterHenson's suggestion, I plotted the spectra of the two samples, with obvious differences in the graphs:
The sample from the CD:
The sample from the MP3:
This practically solves solves my current problem, but still I have these open questions:
If the spectra were visually indistinguishable, I wouldn't know if there is a real difference, or that I just can't distinguish them (i.e. the compression would be of better quality). What else could I try?
Similarly what would I do if I didn't have the MP3 file to compare to, just a single audio sample?
Is there an automated method, that'd answer the question with a reasonable probability?
I made an example to stress the topology of all MP3 transcodes, the source material being a Chopin nocturne. MP3 on top, Lossless on bottom. All recordings have background noise of some amplitude, and that noise is faintly visible here. What the MP3 transcode (Lame's V2 preset in this case) does is create a hard limit at ~16kHz. On a 320kbps bitrate 44.1kHz sample rate MP3, this hard limit appears at around 20kHz, but it would still be visibly different in this image.
You can pick out this shelf without having the original lossless file for comparison. I'm willing to say all music has amplitude at frequencies above even 19kHz. Here's an example for which I do not have the lossless source file, just a 320kbps MP3. You can see the very hard limit at 20kHz as well as a milder cutoff at 19kHz. Were it lossless, that red blob in the middle would extend all the way up to 22kHz since the sample rate is 44.1kHz.
I would say this process is probably automatable, but I do not know of any attempts to automate it. If this were automated, though, I'd say it could pick Lossy from Lossless with much higher accuracy than you or I, by virtue of it being able to analyze the entire spectrum as opposed to just the high frequency cutoffs.
Full res images:
http://i.imgur.com/dezONol.jpg
http://i.imgur.com/1qokxAN.jpg
The above approaches sound very promising although maybe a little complicated -- you might first try something easy, like check the distribution of the least significant bit. In a natural sample, LSB should be an almost exact 50/50 distribution between zeroes and ones (actually across many samples would have some variance following a binomial distribution but with millions or billions of bits this will be ridiculously close to 50/50 in any given sample). In a lossy sample, you will find an unlikely distribution in the LSB.
Something like this:
1 -- extract LSB from each data point
2 -- apply chi-squared test to judge if distribution is unusual
Here is the deal.
A raw sample (or a raw piece of sound) is encoded in a certain quality.
Some sound cards can go further with 64bit sampling.
But let's assume that we have sound files of a certain KNOWN quality.
CD quality is okay for the human ear.
A studio, would make use of more quality samples though. Like 24bit as a standard.
So you got a waveform filename.wav that really has a sample rate 44100 Hz.
What does that mean?
It means the computer can take a huge amount of different samples per second to represent almost the exact sound.
Is the sound original? Depends on how it was made.
If it was made by your computer and a piece of software using a 16bit default sound card yes it is.
If it was from an analogue recording though, it loses some of its quality on the digitization at 44100 Hz fortunately not so significant for the human ear.
NOTE THAT mp3 recordings is a bad idea for professional recording.
But since mp3 recording do exist... this adds complexity to your question. :P
So some sound quality is lost on digitization with a 16bit sound card.
Now similar thing can happen when you encode something to mp3.
Check out your picture. Above 17000 there is no sound. It was butchered to make the sound file significant smaller, without making any significant damage to the audio quality. Is it the same piece of sound? No. It sounds the same though. But a sound engineer LOVES original and good quality samples, because of the information that is NOT cut.
Imagine me, making an original sound, so balanced and compressed that even after an mp3 converting it is hard to tell if it is original sound or not. Imagine me using equalizers to cut any sharp edges, and gate effects to extremely normalize it. Also, my sound generators are some 8bit oscillators passing through some fx and filters.
If I convert it back to wavetable, there might be no difference.
For instance:
[UNCHANGED FREQUENCIES][CUT FREQUENCIES]
Waveform: =================================
mp3: =======================
Waveform: =======================
Waveform:
[UNCHANGED FREQUENCIES][CUT FREQUENCIES]
Waveform: =================
mp3 =================
Waveform: =================
The following seems impossible to me (except if the converter has bugs thing that can be heard)
[UNCHANGED FREQUENCIES][CUT FREQUENCIES]
Waveform: =========================
mp3 =======================
Waveform: =============================
So your question depends on the original source you used in the first waveform.
Good news is that a sample is RARELY THAT limited and compressed.
So it seems to me that the CD you used will probably sound like original waveform,
while as you can see, the mp3 has cut out frequencies.
To be sure of course you need a frequency analyzer and spectrum as MischaNix already has shown.
There are many mp3 encodings too. Some are static, some dynamic, some cut more and some cut less sound information. Some are also bigger than others for that reason.
Now there are lossless formats too.
And then there is ogg that is small enough and also has great quality.
So this question can become a huge topic for no reason here. I will not talk about all these.
If the issue is giving an original sample, your pictures show me significant differences between the two samples. I mean, making a waveform out of the mp3 cut variation, should look like that cut variation. You can not get information out of nothing.
Burn the mp3 on a cd, then get the wave, compare the new waveform with the old and the mp3 waveform. It will probably not be the same thing so you might hit the jackpot here. It is possible you got an original backup on your hands.
From now on though, try sampling raw material and store them in a CD or DVD before discarding them.
Or at least keep good uncompressed samples in a backup.
Open questions:
If the spectra were visually indistinguishable, I wouldn't know if there is a real difference, or that I just can't distinguish them.
Correct. But this would occur seldom without intention on sampling.
Why asking such a question? :) Do you have steganography in mind?
If yes, make sure to keep in mind the nature of the piece of sound you are gonna use. Samples are not appropriate. "Finished songs" are!
Similarly what would I do if I didn't have the MP3 file to compare to, just a single audio sample?
Since there are many mp3 encoding settings of different qualities, you can check if the lowest quality was used. If not there is uncertainty because of the compression capabilities. If this applies to the whole sample, then you got to see if compression was needed. That's why you can not be certain on a song. You don't record with SO hard compression in the first place. I guess this is another meta-reason why you need a natural sound. So if its about a recording you might be lucky.
Now about a finished mastered song... things get rough once again. It is about the nature, the type, of the sound. A recording is easier to figure out what is going on if you knew you used waveform recording. An mp3 recording of course is a waste of time. On the other hand a finished song, usually nowadays makes compressors, limiters, gates and chain compressors burn out. The amount of use of this techniques in modern mastering is enormous. So... you will really need luck to find out if the original piece was compressed before, before having an original waveform to begin with.
Is there an automated method, that'd answer the question with a reasonable probability?
None that I know. Sorry. :(
But that doesn't mean than nobody can make one.
BUT!
A stereo sample is usually split out to two channels. Left and right.
Now if you got a spectrum analyzer in a Digital Audio Workstation,
and take a look only on the left channels of two different samples, you can on the fly see
if they are the same or not I guess.
In order to understand what I mean, take a look at THIS link.
Go at 05:00 and just watch the interface.
Phew. Hope this will help you further, since it took some time. :P
Cheers.
Edit: Fixing some stuff here and there.
I found a description of the problem, a solution and an implementation in Python by Maurits van der Schee, that works with a FLAC though.
From the sample only the first 30 seconds are analyzed. For every
second the frequency spectrum of the sample is computed by applying a
Hanning Window and doing a Fast Fourier Transform. These spectrums are
added, so that eventually you end up with 30 stacked spectrums. These
are divided by 30 to get the average spectrum. Then the spectrum is
normalized using log10. After that we applied a rolling average on the
spectrum with a window size of 1/100th of the frequency, being
44100/100=441 samples.
If there is an unnatural cutoff in the frequency spectrum, this cutoff
is the thing we need to find. We sweep the spectrum from 44100th back
to the 1st frequency, where the variable frequency is f. As soon as
the magnitude at f-220 is more than 1.25 higher than the magnitude at
f and the magnitude at f is no bigger than 1.1x the magnitude at 44100
we have found the cutoff point. The cutoff point is multiplied by 100
and divided by the frequency to get to the percentage of the spectrum
not cut off.
Things to look for:
Cut-off frequency changing on frame boundaries (not going to be a 100% hard cut, but look for "audible" to "inaudible" and vice versa)
Frequencies disappearing or appearing on frame boundaries (again, not 100%)
Noise levels changing on frame boundaries (actually pretty solid for lossy codecs)
For MP3, the frame boundaries are precisely every 1152 samples, though you might be able to "see" the granules every 576 samples.
For Vorbis, the frame boundaries are typically every 128 or 1024 samples depending on transients the encoder "saw". You can probably get away with doing every 128 samples...
You'll have to research the other formats to know their frame sizes (I don't know them offhand).