What integer type is used for MP3 data frames? - audio

I am writing a universal parser library for various binary formats in Rust as part of a personal project. I've started researching the file structure of MP3 files. As I understand it, an MP3 file structure consists of header and data frames. Each header frame provides meta information about the proceeding data frame. Here is a diagram and a listing of allowed values for MP3 header frames that I am referencing.
I understand the format of the MP3 header. My confusion, or lack of information, surrounds MP3 data frames. I can't seem to find a source that specifies what integer type samples are encoded as in the data frame portion of an MP3 file. Are they 8 bit, 16 bit, 32 bit, signed, unsigned, etc?
The best I can think of is, to use a combination of the sample rate frequency and bitrate to calculate what each sample size should. However, that doesn't determine if each sample is a signed or unsigned integer.
I'm not trying decode these files, I'm just trying to parse them. I've had a surprisingly hard time finding this information. Any information or helpful someone can offer would be much appreciated.

Although this is not related to .mp3 per se, there could potentially be some helpful information in Will C. Pirkle's book, Designing Audio Effect Plugins in C++.
He discusses the way in which the .wav audio format stores its information. It uses signed integers starting from -32,768 to 32,767. This represents a range of 2^16 in a bipolar format, where the exponent corresponds to the bit-depth (most commonly 16 or 24).
Another important thing to note is that while phase inversion is a common thing in many audio applications, there is no corresponding integer for inverting -32,768. To compensate, it's common to treat the value -32,768 as -32,767. This only matters though if you are using the value 0 in your processing, which is most often the case. Otherwise, one could extend the upper limit to 32,768.
He does state that it's more common for audio processing applications to deal with floating point numbers either between 0.0f and 1.0f or -1.0f and 1.0f. The reason is that due to addition and multiplication being common operations in DSP, we avoid overflowing that range if we use these floating points. In the bipolar integer format, it's too easy to find two numbers that result in a product or sum outside that range. In the range of -1.0f to 1.0f, any two numbers will always result in a product that's still within that range. Unfortunately, addition still requires caution, but eh...
I'm sorry I don't have more information about .mp3s specifically, but perhaps this could still be insightful.
Good luck!

Related

Seeking on Ogg/Opus

I have ogg-opus audio files each containing a single track (mono) and of fixed sample rate (16kHz). I'm trying to implement seeking on them for streaming. For example, I want to know byte offsets to partially download a file (with HTTP Range) and play only the first 10 seconds, or say from second 10 to second 15. That is, I need to get the the byte offset at any given time position.
Is there a way to do it without loading/decoding an entire file in this case?
I don't believe there's an exact way to determine the exact byte offset required for a specific time, but libopus.op_pcm_seek() could be used for decoding once you have the bytes. Between the varying bit rates, page sizes, and packet durations of Opus files, some guesswork and dynamic calculations seem to be required. I'm attempting to do the same thing and a few people have asked me to implement it in OpusStreamDecoder. You could look at its underlying opus_chunkdecoder.c and the specific feature request which outlines how this could be achieved:
https://github.com/AnthumChris/opus-stream-decoder/issues/1

MPEG-DASH trick modes

Does anyone know how to do trick modes (rewind/forward at different speeds) with MPEG-DASH ?
DASH-IF Interoperability Points V3.0 states that it is possible.
the general idea is laid out in the document but the details are not specified.
A DASH segmenter should add tracks with a frame rate lower than normal to a specially marked AdaptationSet. Roughly you could say (even though in theory you should look at the exact profile/level thresholds) half frame rate is double playoutrate. A quarter frame rate is quadruple playoutrate.
All this is only an offer to the DASH client to facilitate ffwd. The client can use it but doesn't have to. If the DASH client doesn't understand the AdaptationSet at all it will disregard it due the EssentialProperty that marking it as track play AdaptationSet.
I can't see that fast rewind can be supported in any spec conforming way. You'd need to implement it according to your needs but with no expectation of interoperability.
You can try an indication on ISO/IEC 23009-1:2014(E) =>
Annex A
The client may pause or stop a Media Presentation. In this case client simply stops requesting Media Segments or parts thereof. To resume, the client sends requests to Media Segments, starting with the next Subsegment after the last requested Subsegment.
If a specific Representation or SubRepresentation element includes the #maxPlayoutRate attribute, then the corresponding Representation or Sub-Representation may be used for the fast-forward trick mode. The client may play the Representation or Sub-Representation with any speed up to the regular speed times the specified #maxPlayoutRate attribute with the same decoder profile and level requirements as the normal playout rate. If a specific Representation or SubRepresentation element includes the #codingDependency attribute with value set to 'false', then the corresponding Representation or Sub-Representation may be used for both fast-forward and fast-rewind trick modes.
Sub-Representations in combination with Index Segments and Subsegment Index boxes may be used for efficient trick mode implementation. Given a Sub-Representation with the desired #maxPlayoutRate, ranges corresponding to SubRepresentation#level all level values from SubRepresentation#dependencyLevel may be extracted via byte ranges constructed from the information in Subsegment Index Box. These ranges can be used to construct more compact HTTP GET request.

Determining the 'amount' of speaking in a video

I'm working on a project to transcribe lecture videos. We are currently just using humans to do the transcriptions as we believe it is easier to transcribe than editing ASR, especially for technical subjects (not the point of my question, though I'd love any input on this). From our experiences we've found that after about 10 minutes of transcribing we get anxious or lose focus. Thus we have been splitting videos into ~5-7 minute chunks based on logical breaks in the lecture content. However, we've found that the start of a lecture (at least for the class we are piloting) often has more talking than later on, which often has time where the students are talking among themselves about a question. I was thinking that we could do signal processing to determine the rough amount of speaking throughout the video. The idea is to break the video into segments containing roughly the same amount of lecturing, as opposed to segments that are the same length.
I've done a little research into this, but everything seems to be a bit overkill for what I'm trying to do. The videos for this course, though we'd like to generalize, contain basically just the lecturer with some occasional feedback and distant student voices. So can I just simply look at the waveform and roughly use the spots containing audio over some threshold to determine when the lecturer is speaking? Or is an ML approach really necessary to quantify the lecturer's speaking?
Hope that made sense, I can clarify anything if necessary.
Appreciate the help as I have no experience with signal processing.
Although there are machine learning mehtods that are very good at discriminating voice from other sounds, you don't seem to require that sort of accuracy for your application. A simple level-based method similar to the one you proposed should be good enough to get you an estimate of speaking time.
Level-Based Sound Detection
Goal
Given an audio sample, discriminate the portions with a high amount of sounds from the portions that consist of background noise. This can then be easily used to estimate the amount of speech in a sound file.
Overview of Method
Rather than looking at raw levels in the signal, we will first convert it to a sliding-window RMS. This gives a simple measure of how much audio energy is at any given point of the audio sample. By analyzing the RMS signal we can automatically determine a threshold for distinguishing between backgroun noise and speech.
Worked Example
I will be working this example in MATLAB because it makes the math easy to do and lets me create illustrations.
Source Audio
I am using President Kennedy's "We choose to go to the moon" speech. I'm using the audio file from Wikipedia, and just extracting the left channel.
imported = importdata('moon.ogg');
audio = imported.data(:,1);
plot(audio);
plot((1:length(audio))/imported.fs, audio);
title('Raw Audio Signal');
xlabel('Time (s)');
Generating RMS Signal
Although you could techinically implement an overlapping per-sample sliding window, it is simpler to avoid the overlapping and you'll get very similar results. I broke the signal into one-second chunks, and stored the RMS values in a new array with one entry per second of audio.
audioRMS = [];
for i = 1:imported.fs:(length(audio)-imported.fs)
audioRMS = [audioRMS; rms(audio(i:(i+imported.fs)))];
end
plot(1:length(audioRMS), audioRMS);
title('Audio RMS Signal');
xlabel('Time (s)');
This results in a much smaller array, full of positive values representing the amount of audio energy or "loudness" per second.
Picking a Threshold
The next step is to determine how "loud" is "loud enough." You can get an idea of the distribution of noise levels with a histogram:
histogram(audioRMS, 50);
I suspect that the lower shelf is the general background noise of the crowd and recording environment. The next shelf is probably the quieter applause. The rest is speech and loud crowd reactions, which will be indistinguishable to this method. For your application, the loudest areas will almost always be speech.
The minimum value in my RMS signal is .0233, and as a rough guess I'm going to use 3 times that value as my criterion for noise. That seems like it will cut off the whole lower shelf and most of the next one.
A simple check against that threshold gives a count of 972 seconds of speech:
>> sum(audioRMS > 3*min(audioRMS))
ans =
972
To test how well it actually worked, we can listen to the audio that was eliminated.
for i = 1:length(speech)
if(~speech(i))
clippedAudio = [clippedAudio; audio(((i-1)*imported.fs+1):i*imported.fs)];
end
end
>> sound(clippedAudio, imported.fs);
Listening to this gives a bit over a minute of background crowd noise and sub-second clips of portions of words, due to the one-second windows used in the analysis. No significant lengths of speech are clipped. Doing the opposite gives audio that is mostly the speech, with clicks heard as portions are skipped. The louder applause breaks also make it through.
This means that for this speech, the threshold of three times the minimum RMS worked very well. You'll probably need to fiddle with that ratio to get good automatic results for your recording environment, but it seems like a good place to start.

Using "seed" based math to recreate application instances

Okay so I was thinking today about Minecraft a game which so many of you are so familiar with, I'm sure and while my question isn't directly related to the game I find it much simply to describe my question using the game as an example.
My question is, is there any way a type of "seed" or string of characters can be used to recreate an instance of a program (not in the literal programming sense) by storing a code which when re-entered into this program as a string at run-time, could recreate the data it once held again, in fields, text boxes, canvases, for example, exactly as it was.
As I understand it, Minecraft takes the string of ASCII characters you enter, all which truly are numbers, and performs a series of operations on it which evaluate to some type of hash or number which is finite... this number (again as I understand) is the representation of that string you entered. So it makes sense that because a string when parsed by this algorithm will always evaluate to the same hash. 1 + 1 will always = 2 so a seeds value must always equal that seeds value in the end. And in doing so you have the ability to replicate exactly, worlds, by entering this sort of key which is evaluated the same on every machine.
Now, if we can exactly replicate worlds like this this is it possible to bring it into a more abstract concept like the following?...
Say you have an application, like Microsoft Word. Word saved the data you have entered as a file on your hard drive it holds formatting data, the strings you've entered, the format of the file... all that on a physical file... Now imagine if when you entered your essay into Word instead of saving it and bringing your laptop to school you instead click on parse and instead of creating a file, you are given a hash code... Now you goto school you know you have to print it. so you log onto the computer and open Word... Now instead of open there is an option now called evaluate you click it and enter the hash your other computer formulated and it creates the exact essay you have written.
Is this possible, and if so are there obvious implementations of this i simply am not thinking of or are just so seemingly part of everyday I don't think recognize it? And finally... if possible, what methods and algorithms would go into such a thing?
[EDIT]
I had to do some research on the anatomy of a seed and I think this explains it well
The limit is 32 characters or for a
numeric seed, 19 digits plus the minus sign.
Numeric seeds can range from -9223372036854775808 to
9223372036854775807 which is a total of 18446744073709551616 Text
strings entered will be "hashed" to one of the numeric seeds in the
above range. The "Seed for the World Generator" window only allows 32
characters to be entered and will not show or use any more than that."
BUT looking back on it lossless compression IS EXACTLY what I was
describing after re-reading the wiki page and remembering that (you
are very correct) the seed only partakes in the generation, the final
data is stores as a "physical" file on the HDD (which again, you are correct) is raw uncompressed data in a file
So in retrospect, I believe I was describing lossless compression, trying in my mind to figure out how the seed was able to replicate the exact same world, forgetting the seed was only responsible for generating the code, not the saving or compression of it.
So thank you for your help guys! It's really appreciated I believe we can call this one solved!
There are several possibilities to achieve this "string" that recovers your data. However they're not all applicable depending on the context.
An actual seed, which initializes for example a peudo-random number generator, then allows to recreate the same sequence of pseudo-random numbers (see this question).
This is possibly similar to what Minecraft relies on, because the whole process of how to create a world based on some choices (possibly pseudo-random choices) is known in advance. Even if we pretend that we have random numbers, computers are actually deterministic, which makes this possible.
If your document were generated randomly then this would be applicable: with the same seed, the same gibberish comes out.
Some key-value dictionary, or hash map. Then the values have to be accessible by both sides and the string is the key that allows to retrieve the value.
Think for example of storing your word file on an online server, then your key is the URL linking to your file.
Compressing all the information that is in your data into the string. This is much harder, and there are strong limits due to the entropy of the data. See Shannon's source coding theorem for example.
You would be better off (as in, it would be easier) to just compress your file with a usual algorithm (zip or 7z or something else), rather than reimplementing it yourself, especially as soon as your document starts having fancy things (different styles, tables, pictures, unusual characters...)
With the simple hypothesis of 27 possible characters (26 letters and the space), Shannon himself shows in Prediction and Entropy of Printed English (Bell System Technical Journal, 30: 1. January 1951 pp 50-64, online version) that there is about 2.14 bits of entropy per letter in English. That's about 550 characters encoded with your 32 character string.
While this is significantly better than the 8 bits we use for each ASCII character, it also shows it is very likely to be impossible to encode a document in English in less than a fourth of its size. Then you'd still have to add punctuation, and all the rest of the fuss.

How do computers process ascii/text & images/color differently?

I've recently been thinking more about what kind of work computer hardware has to do to produce the things we expect.
Comparing text and color, it seems that both rely on combinations of 1's and 0's with 256 possible combinations per byte. ASCII may represent a letter such as (01100001) to be the letter 'A'. But then there may be a color R(01100001), G(01100001), B(01100001) representing some random color. Considering on a low level, the computer is just reading these collections of 1's and 0's, what needs to happen to ensure the computer renders the color R(01100001), G(01100001), B(01100001) and not the letter A three times on my screen?
I'm not entirely sure this question is appropriate for Stack Overflow, but I'll go ahead and give a basic answer anyways. Though it's actually a very complicated question because depending on how deep you want to go into answering it I could write an entire book on computer architecture in order to do so.
So to keep it simple I'll just give you this: It's all a matter of context. First let's just tackle text:
When you open, say, a text editor the implicit assumption is the data to be displayed in it is textual in nature. The text to be displayed is some bytes in memory (possibly copied out of some bytes on disk). There's no magical internal context from the memory's point of view that these bytes are text. Instead, the source for text editor contains some commands that point to those bytes and say "these bytes represent 300 characters of text" for example. Then there's a complex sequence of steps involving library code all the way to hardware that handles mapping those bytes according to an encoding like ASCII (there are many other ways of encoding text) to characters, finding those characters in a font, writing that font to the screen, etc.
The point is it doesn't have to interpret those bytes as text. It just does because that's what a text editor does. You could hypothetically open it in an image program and tell it to interpret those same 300 bytes as a 10x10 array (or image) of RGB values.
As for colors the same logic applies. They're just bytes in memory. But when the code that's drawing something to the screen has decided what pixels it wants to write with what colors, it will pipe those bytes via a memory mapping to the video card which will then translate them to commands that are sent to the monitor (still in some binary format representing pixels and the colors, though the reality is a lot more complicated), and the monitor itself contains firmware that then handles the detail of mapping those colors to the physical pixels. The numbers that represent the colors themselves will at some point be converted to a specific current to each R/G/B channel to raise or lower its intensity.
That's all I have time for for now but it's a start.
Update: Just to illustrate my point, I took the text of Flatland from here. Which is just 216624 bytes of ASCII text (interpreted as such by your web browser based on context: the .txt extension helps, but the web server also provides a MIME type header informing the browser that it should be interpreted as plain text. Your browser might also analyze the bytes to determine that their pattern does look like that of plain text (and that there aren't an overwhelming number of bytes that don't represent ASCII characters). I appended a few spaces to the end of the text so that its length is 217083 which is 269 * 269 * 3 and then plotted it as a 269 x 269 RGB image:
Not terribly interesting-looking. But the point is that I just took those same exact bytes and told the software, "okay, these are RGB values now". That's not to say that looking at plain text bytes as images can't be useful. For example, it can be a useful way to visualize an encryption algorithm. This shows an image that was encrypted with a pretty insecure algorithm--you can still get a very good sense of the patterns of bytes in the original unencrypted file. If it were text and not an image this would be no different, as text in a specific language like English also has known statistical patterns. A good encryption algorithm would look make the encrypted image look more like random noise.
Zero and one are just zero and one, nothing more. A byte is just a collection of 8 bit.
The meaning you assign to information depends on what you need at the moment, what "language" you use to interpret your information. 65 is either letter A in ASCII or number 65 if you're using it in, say, int a = 65 + 3.
At low level, different (thousands of) machine instructions are executed to ensure that your data is treated properly, depending for example on the type of file you're reading, its headers, which process requests the data, and so on. The different high-level functions you use to treat different information expand to very different machine code.

Resources