How does an audio converter work? - audio

I currently have the idea to code a small audio converter (e.g. FLAC to MP3 or m4a format) application in C# or Python but my problem is I do not know at all how audio conversion works.
After a research, I heard about Analog-to-digital / Digital-to-analog converter but I guess it would be a Digital-to-digital or something like that isn't it ?
If someone could precisely explain how it works, it would be greatly appreciated.
Thanks.

digital audio is called PCM which is the raw audio format fundamental to any audio processing system ... its uncompressed ... just a series of integers representing the height of the audio curve for each sample of the curve (the Y axis where time is the X axis along this curve)
... this PCM audio can be compressed using some codec then bundled inside a container often together with video or meta data channels ... so to convert audio from A to B you would first need to understand the container spec as well as the compressed audio codec so you can decompress audio A into PCM format ... then do the reverse ... compress the PCM into codec of B then bundle it into the container of B
Before venturing further into this I suggest you master the art of WAVE audio files ... beauty of WAVE is that its just a 44 byte header followed by the uncompressed integers of the audio curve ... write some code to read a WAVE file then parse the header (identify bit depth, sample rate, channel count, endianness) to enable you to iterate across each audio sample for each channel ... prove that its working by sending your bytes into an output WAVE file ... diff input WAVE against output WAVE as they should be identical ... once mastered you are ready to venture into your above stated goal ... do not skip over groking notion of interleaving stereo audio as well as spreading out a single audio sample which has a bit depth of 16 bits across two bytes of storage and the reverse namely stitching together multiple bytes into a single integer with a bit depth of 16, 24 or even 32 bits while keeping endianness squared away ... this may sound scary at first however all necessary details are on the net as its how I taught myself this level of detail
modern audio compression algorithms leverage knowledge of how people perceive sound to discard information which is indiscernible ( lossy ) as opposed to lossless algorithms which retain all the informational load of the source ... opus (http://opus-codec.org/) is a current favorite codec untainted by patents and is open source

Related

What is the difference between a WAV file and an M4A file?

I'm looking to convert some audio files into spectrograms. I'm wondering what the difference is between an m4a and wav file. If I have two of the same audio recording, one saved as wav and the other as m4a, will there be a difference in the spectrogram representations of both?
Both WAV and M4A are container formats, with options for how exactly audio data is encoded and represented inside the file. WAV file has one audio track with variety of encoding options including those possible for M4A format. However most often typically WAV refers to having uncompressed audio inside, where data is contained in PCM format.
M4A files are MP4 (MPEG-4 Part 14) files with an implication that there is one audio track inside. There are much less encoding options even though they still include both compressed and uncompressed ones. Most often M4A has audio encoded with AAC encoding, which is a lossy encoding. Depending on that loss, roughly on how much of information was lost during the encoding, your spectrogram could be different from the one built on original data.
The m4a format uses a lossy compression algorithm, so there may be differences, depending on compression level, and the resolution and depth of the spectrogram. The .wav format can also be lossy, due to quantization of the sound by an A/D or any sample format/rate conversions. So the difference may be in the noise floor, or in the portions of the sound's spectrum that are usually inaudible (due to masking effects and etc.) to humans.

How can I convert an audio (.wav) to satellite image

I need to create a software can capture sound (from NOAA Satellite with RTL-SDR). The problem is not capture the sound, the problem is how I converted the audio or waves into an image. I read many things, Fourier Fast Transformed, Hilbert Transform, etc... but I don't know how.
If you can give me an idea it would be fantastic. Thank you!
Over the past year I have been writing code which makes FFT calls and have amassed 15 pages of notes so the topic is vast however I can boil it down
Open up your WAV file ... parse the 44 byte header and note the given bit depth and endianness attributes ... then read across the payload which is everything after that header ... understand notion of bit depth as well as endianness ... typically a WAV file has a bit depth of 16 bits so each point on the audio curve will be stored across two bytes ... typically WAV file is little endian not big endian ... knowing what that means you take the next two bytes then bit shift one byte to the left (if little endian) then bit OR that pair of bytes into an integer then convert that int which typically varies from 0 to (2^16 - 1) into its floating point equivalent so your audio curve points now vary from -1 to +1 ... do that conversion for each set of bytes which corresponds to each sample of your payload buffer
Once you have the WAV audio curve as a buffer of floats which is called raw audio or PCM audio then perform your FFT api call ... all languages have such libraries ... output of FFT call will be a set of complex numbers ... pay attention to notion of the Nyquist Limit ... this will influence how your make use of output of your FFT call
Now you have a collection of complex numbers ... the index from 0 to N of that collection corresponds to frequency bins ... the size of your PCM buffer will determine how granular your frequency bins are ... lookup this equation ... in general more samples in your PCM buffer you send to the FFT api call will give you finer granularity in the output frequency bins ... essentially this means as you walk across this collection of complex numbers each index will increment the frequency assigned to that index
To visualize this just feed this into a 2D plot where X axis is frequency and Y axis is magnitude ... calculate this magnitude for each complex number using
curr_mag = 2.0 * math.Sqrt(curr_real*curr_real+curr_imag*curr_imag) / number_of_samples
For simplicity we will sweep under the carpet the phase shift information available to you in your complex number buffer
This only scratches the surface of what you need to master to properly render a WAV file into a 2D plot of its frequency domain representation ... there are libraries which perform parts or all of this however now you can appreciate some of the magic involved when the rubber hits the road
A great explanation of trade offs between frequency resolution and number of audio samples fed into your call to an FFT api https://electronics.stackexchange.com/questions/12407/what-is-the-relation-between-fft-length-and-frequency-resolution
Do yourself a favor and checkout https://www.sonicvisualiser.org/ which is one of many audio workstations which can perform what I described above. Just hit menu File -> Open -> choose a local WAV file -> Layer -> Add Spectrogram ... and it will render the visual representation of the Fourier Transform of your input audio file as such

How to compare / match two non-identical sound clips

I need to take short sound samples every 5 seconds, and then upload these to our cloud server.
I then need to find a way to compare / check if that sample is part of a full long audio file.
The samples will be recorded from a phones microphone, so they will indeed not be exact.
I know this topic can get quite technical and complex, but I am sure there must be some libraries or online services that can assist in this complex audio matching / pairing.
One idea was to use a audio to text conversion service and then do matching based on the actual dialog. However this does not feel efficient to me. Where as matching based on actual sound frequencies or patterns would be a lot more efficient.
I know there are services out there such as Shazam that do this type of audio matching. However I would imagine their services are all propriety.
Some factors that could influence it:
Both audio samples with be timestamped. So we donot have to search through the entire sound clip.
To give you traction on getting an answer you need to focus on an answerable question where you have done battle and show your code
Off top of my head I would walk across the audio to pluck out a bucket of several samples ... then slide your bucket across several samples and perform another bucket pluck operation ... allow each bucket to contain overlap samples also contained in previous bucket as well as next bucket ... less samples quicker computation more samples greater accuracy to an extent YMMV
... feed each bucket into a Fourier Transform to render the time domain input audio into its frequency domain counterpart ... record into a database salient attributes of the FFT of each bucket like what are the X frequencies having most energy (greatest magnitude on your FFT)
... also perhaps store the standard deviation of those top X frequencies with respect to their energy (how disperse are those frequencies) ... define additional such attributes as needed ... for such a frequency domain approach to work you need relatively few samples in each bucket since FFT works on periodic time series data so if you feed it 500 milliseconds of complex audio like speech or music you no longer have periodic audio, instead you have mush
Then once all existing audio has been sent through above processing do same to your live new audio then identify what prior audio contains most similar sequence of buckets matching your current audio input ... use a Bayesian approach so your guesses have probabilistic weights attached which lend themselves to real-time updates
Sounds like a very cool project good luck ... here are some audio fingerprint resources
does audio clip A appear in audio file B
Detecting audio inside audio [Audio Recognition]
Detecting audio inside audio [Audio Recognition]
Detecting a specific pattern from a FFT in Arduino
Detecting a specific pattern from a FFT in Arduino
Audio Fingerprinting using the AudioContext API
https://news.ycombinator.com/item?id=21436414
https://iq.opengenus.org/audio-fingerprinting/
Chromaprint is the core component of the AcoustID project.
It's a client-side library that implements a custom algorithm for extracting fingerprints from any audio source
https://acoustid.org/chromaprint
Detecting a specific pattern from a FFT
Detecting a specific pattern from a FFT in Arduino
Audio landmark fingerprinting as a Node Stream module - nodejs converts a PCM audio signal into a series of audio fingerprints.
https://github.com/adblockradio/stream-audio-fingerprint
SO followup
How to compare / match two non-identical sound clips
How to compare / match two non-identical sound clips
Audio fingerprinting and recognition in Python
https://github.com/worldveil/dejavu
Audio Fingerprinting with Python and Numpy
http://willdrevo.com/fingerprinting-and-audio-recognition-with-python/
MusicBrainz: an open music encyclopedia (musicbrainz.org)
https://news.ycombinator.com/item?id=14478515
https://acoustid.org/chromaprint
How does Chromaprint work?
https://oxygene.sk/2011/01/how-does-chromaprint-work/
https://acoustid.org/
MusicBrainz is an open music encyclopedia that collects music metadata and makes it available to the public.
https://musicbrainz.org/
Chromaprint is the core component of the AcoustID project.
It's a client-side library that implements a custom algorithm for extracting fingerprints from any audio source
https://acoustid.org/chromaprint
Audio Matching (Audio Fingerprinting)
Is it possible to compare two similar songs given their wav files?
Is it possible to compare two similar songs given their wav files?
audio hash
https://en.wikipedia.org/wiki/Hash_function#Finding_similar_records
audio fingerprint
https://encrypted.google.com/search?hl=en&pws=0&q=python+audio+fingerprinting
ACRCloud
https://www.acrcloud.com/
How to recognize a music sample using Python and Gracenote?
Audio landmark fingerprinting as a Node Stream module - nodejs converts a PCM audio signal into a series of audio fingerprints.
https://github.com/adblockradio/stream-audio-fingerprint

Why review compositing work in MJPEG videos rather than (say) H.264?

I have received a request to encode DPX files to MOV/MJPEG rather than MOV/H.264 (which ffmpeg picks by default if you convert to output.mov). These is to review compositing renders (in motion), so color accuracy is critical.
Comparing a sample "ideal" MOV to the current (H.264) output I can see:
resolution: the same
ColorSpace/Primaries: Rec609 (SD) versus Rec709 (HD)
YUV: 4:2:0 versus 4:4:4
filesize: smaller
The ffmpeg default seems to be better quality and result in a smaller filesize. Is there something I'm missing?
Maybe it's because MJPEG frames are independent of each other, so any snippet of video can be decoded / copied in isolation. With an inter-frame compression algorithm like H.264, the software has to scan data for potentially numerous frames to reconstruct any given one.

How to know the bit depth of a mp3 file?

A MP3 file header only contain the sample rate and bit rate, so the decoder can't figure out the bit depth from the header. Maybe it can only guess from the bit rate? But the bit rate is varying from frame to frame.
Here is another way to ask this question: If I encoder an 24 bit WAV to mp3, so how does the 24-bit info stored in this mp3?
When the source WAV is compressed, the original bit depth information is "thrown away". This is by design in any compressed audio codec since the whole point is to use the least bits possible to store the "same" audio.
Internally, MP3 uses Huffman symbols to store the processed audio data. As such, there's no real "bit depth" to report.
During the encoding process, the samples are quantized, so the original bit depth information is lost.
MP3 decoders either choose a bitdepth they operate at, or allow the end-user/application to dictate it. The bitdepth is determined during "re-quantization".
Have a read of http://blog.bjrn.se/2008/10/lets-build-mp3-decoder.html which is rather enlightening

Resources