Seeking on Ogg/Opus - audio

I have ogg-opus audio files each containing a single track (mono) and of fixed sample rate (16kHz). I'm trying to implement seeking on them for streaming. For example, I want to know byte offsets to partially download a file (with HTTP Range) and play only the first 10 seconds, or say from second 10 to second 15. That is, I need to get the the byte offset at any given time position.
Is there a way to do it without loading/decoding an entire file in this case?

I don't believe there's an exact way to determine the exact byte offset required for a specific time, but libopus.op_pcm_seek() could be used for decoding once you have the bytes. Between the varying bit rates, page sizes, and packet durations of Opus files, some guesswork and dynamic calculations seem to be required. I'm attempting to do the same thing and a few people have asked me to implement it in OpusStreamDecoder. You could look at its underlying opus_chunkdecoder.c and the specific feature request which outlines how this could be achieved:
https://github.com/AnthumChris/opus-stream-decoder/issues/1

Related

hls ext-x-discontinuity-sequence for endless stream

In an endless HLS stream, I'm not sure how to implement the EXT-X-DISCONTINUITY-SEQUENCE tag.
The RFC states :
If the server removes an EXT-X-DISCONTINUITY tag from the Media
Playlist, it MUST increment the value of the EXT-X-DISCONTINUITY-
SEQUENCE tag so that the Discontinuity Sequence Numbers of the
segments still in the Media Playlist remain unchanged. The value of
the EXT-X-DISCONTINUITY-SEQUENCE tag MUST NOT decrease or wrap.
Clients can malfunction if each Media Segment does not have a
consistent Discontinuity Sequence Number.
The media playlist I create always have the same number of segments, and the older one gets deleted when a newer one is added. Sometimes, there might be a discontinuity between two segments, so I add an EXT-X-DISCONTINUITY to the segment. However, after some time, when there are no more discontinuities in the playlist, I remove this tag and should increment the EXT-X-DISCONTINUITY-SEQUENCE.
Since the stream is endless, it will have to wrap at some point. How do people usually implement this ?
The value of EXT-X-DISCONTINUITY-SEQUENCE is defined as a decimal-integer which is defined as a number in the range 0 - 2^64 - 1 (see https://datatracker.ietf.org/doc/html/draft-pantos-hls-rfc8216bis-09#section-4.2)
Even if you were incrementing EXT-X-DISCONTINUITY-SEQUENCE many many times a second (which the question implies you are not) it seems highly unlikely you would need to handle wrapping of this value.
Given the possible range and relatively slow incrementing in the general case, I seriously doubt anyone worries about wrapping this value, but I'd be interested to be proved wrong.

What integer type is used for MP3 data frames?

I am writing a universal parser library for various binary formats in Rust as part of a personal project. I've started researching the file structure of MP3 files. As I understand it, an MP3 file structure consists of header and data frames. Each header frame provides meta information about the proceeding data frame. Here is a diagram and a listing of allowed values for MP3 header frames that I am referencing.
I understand the format of the MP3 header. My confusion, or lack of information, surrounds MP3 data frames. I can't seem to find a source that specifies what integer type samples are encoded as in the data frame portion of an MP3 file. Are they 8 bit, 16 bit, 32 bit, signed, unsigned, etc?
The best I can think of is, to use a combination of the sample rate frequency and bitrate to calculate what each sample size should. However, that doesn't determine if each sample is a signed or unsigned integer.
I'm not trying decode these files, I'm just trying to parse them. I've had a surprisingly hard time finding this information. Any information or helpful someone can offer would be much appreciated.
Although this is not related to .mp3 per se, there could potentially be some helpful information in Will C. Pirkle's book, Designing Audio Effect Plugins in C++.
He discusses the way in which the .wav audio format stores its information. It uses signed integers starting from -32,768 to 32,767. This represents a range of 2^16 in a bipolar format, where the exponent corresponds to the bit-depth (most commonly 16 or 24).
Another important thing to note is that while phase inversion is a common thing in many audio applications, there is no corresponding integer for inverting -32,768. To compensate, it's common to treat the value -32,768 as -32,767. This only matters though if you are using the value 0 in your processing, which is most often the case. Otherwise, one could extend the upper limit to 32,768.
He does state that it's more common for audio processing applications to deal with floating point numbers either between 0.0f and 1.0f or -1.0f and 1.0f. The reason is that due to addition and multiplication being common operations in DSP, we avoid overflowing that range if we use these floating points. In the bipolar integer format, it's too easy to find two numbers that result in a product or sum outside that range. In the range of -1.0f to 1.0f, any two numbers will always result in a product that's still within that range. Unfortunately, addition still requires caution, but eh...
I'm sorry I don't have more information about .mp3s specifically, but perhaps this could still be insightful.
Good luck!

MPEG-DASH trick modes

Does anyone know how to do trick modes (rewind/forward at different speeds) with MPEG-DASH ?
DASH-IF Interoperability Points V3.0 states that it is possible.
the general idea is laid out in the document but the details are not specified.
A DASH segmenter should add tracks with a frame rate lower than normal to a specially marked AdaptationSet. Roughly you could say (even though in theory you should look at the exact profile/level thresholds) half frame rate is double playoutrate. A quarter frame rate is quadruple playoutrate.
All this is only an offer to the DASH client to facilitate ffwd. The client can use it but doesn't have to. If the DASH client doesn't understand the AdaptationSet at all it will disregard it due the EssentialProperty that marking it as track play AdaptationSet.
I can't see that fast rewind can be supported in any spec conforming way. You'd need to implement it according to your needs but with no expectation of interoperability.
You can try an indication on ISO/IEC 23009-1:2014(E) =>
Annex A
The client may pause or stop a Media Presentation. In this case client simply stops requesting Media Segments or parts thereof. To resume, the client sends requests to Media Segments, starting with the next Subsegment after the last requested Subsegment.
If a specific Representation or SubRepresentation element includes the #maxPlayoutRate attribute, then the corresponding Representation or Sub-Representation may be used for the fast-forward trick mode. The client may play the Representation or Sub-Representation with any speed up to the regular speed times the specified #maxPlayoutRate attribute with the same decoder profile and level requirements as the normal playout rate. If a specific Representation or SubRepresentation element includes the #codingDependency attribute with value set to 'false', then the corresponding Representation or Sub-Representation may be used for both fast-forward and fast-rewind trick modes.
Sub-Representations in combination with Index Segments and Subsegment Index boxes may be used for efficient trick mode implementation. Given a Sub-Representation with the desired #maxPlayoutRate, ranges corresponding to SubRepresentation#level all level values from SubRepresentation#dependencyLevel may be extracted via byte ranges constructed from the information in Subsegment Index Box. These ranges can be used to construct more compact HTTP GET request.

Comparing audio recordings

I have 5 recorded wav files. I want to compare the new incoming recordings with these files and determine which one it resembles most.
In the final product I need to implement it in C++ on Linux, but now I am experimenting in Matlab. I can see FFT plots very easily. But I don't know how to compare them.
How can I compute the similarity of two FFT plots?
Edit: There is only speech in the recordings. Actually, I am trying to identify the response of answering machines of a few telecom companies. It's enough to distinguish two messages "this person can not be reached at the moment" and "this number is not used anymore"
This depends a lot on your definition of "resembles most". Depending on your use case this can be a lot of things. If you just want to compare the bare spectra of the whole file you can just correlate the values returned by the two ffts.
However spectra tend to change a lot when the files get warped in time. To figure out the difference with this, you need to do a windowed fft and compare the spectra for each window. This then defines your difference function you can use in a Dynamic time warping algorithm.
If you need perceptual resemblance an FFT probably does not get you what you need. An MFCC of the recordings is most likely much closer to this problem. Again, you might need to calculate windowed MFCCs instead of MFCCs of the whole recording.
If you have musical recordings again you need completely different aproaches. There is a blog posting that describes how Shazam works, so you might be able to find this on google. Or if you want real musical similarity have a look at this book
EDIT:
The best solution for the problem specified above would be the one described here ("shazam algorithm" as mentioned above).This is however a bit complicated to implement and easier solution might do well enough.
If you know that there are only 5 different different possible incoming files, I would suggest trying first something as easy as doing the euclidian distance between the two signals (in temporal or fourier). It is likely to give you good result.
Edit : So with different possible starts, try doing an autocorrelation and see which file has the higher peak.
I suggest you compute simple sound parameter like fundamental frequency. There are several methods of getting this value - I tried autocorrelation and cepstrum and for voice signals they worked fine. With such function working you can make time-analysis and compare two signals (base - to which you compare, in - which you would like to match) on given interval frequency. Comparing several intervals based on such criteria can tell you which base sample matches the best.
Of course everything depends on what you mean resembles most. To compare function you can introduce other parameters like volume, noise, clicks, pitches...

Audio normalization/fixation?

I am using some audio fingerprinting technique to mark songs in long recordings. For example, in radio show records. Fingerprinting mechanism works fine but i have a problem with normalization (or downsampling).
Here you can see two same songs but different waveforms. I know i should make some DC Offset fixation and use some high and low gain filters. I already do them by Sox using highpass 1015 and lowpass 1015. And i use wavegain to fix the volume and DC Offset. But in this case wave forms turns to one like below:
But even in this case, i can't get the same fingerprint. (I am not expecting %100 same but at least %50 would be good)
So. What do you think? What can i do to fix records to have same fingerprints? Maybe some audio filtering would work but i don't know which one to use? Can you help me?
By the way, here is the explanation of fingerprinting technique.
http://wiki.musicbrainz.org/Future_Proof_Fingerprint
http://wiki.musicbrainz.org/Future_Proof_Fingerprint_Function
Your input waveforms appear to be clipping, so no amount of filtering is going to result in a meaningful "fingerprint". Make sure you collect valid input samples that have a reasonable dynamic range but which do not clip.

Resources