Is it possible to stitch MP3 frames together? - audio

I'm working with CBR, no bit reservoir, 192k bitrate, and 48k sample rate MP3 files.
CBR + 192k bitrate + 48k sample rate gives a clean 576 bytes per frame.
No bit reservoir is to make each frame independent.
The reason I want to stitch them is that I want to stream the MP3 (chunk by chunk).
Therefore I need to decode each chunk into PCM for playback.
When stitching the raw PCM data of the decoded MP3 together, I can hear a click/glitch/silence/something between each chunk on playback.
How can I stream MP3 perfectly without any click, considering my constraints (only CBR, no bit reservoir, etc)? Is it even possible?

I don't think you can cut and concatenate MP3 frames naively.
The Inverse Modified Discrete Cosine Transform (IMDCT) which is part of the decoding process - has different windowing modes.
The windowing mode is signaled within each MP3 frame.
In at least one windowing mode the IMDCT is reusing values from the previous MP3 frame.
This means - you need to decode the previous frame to decode the current frame correctly.
Lets assume you have packets from file a and file b and you like to play:
a1 a2 a3 b6 b7 b8
to decode b6 correctly - you need to decode b5 and then throw away the PCM samples of b5.
So at the cut you have to prime the decoder with b5 without playing b5.
a1 a2 a3 [b5] b6 b7 b8
You could send your player an additional packet at the cuts and then signal the player to discard the primming samples of the decoded additional packets.

Related

what is audio PCM's frame sync word to identify the beginning position

As title; for some compressed format such as EAC3, AC3 frame starts as a sync word.
So what's PCM (raw audio)'s sync word? How to identify the beginning of a PCM frame?
I met a problem where audio is concatenated by several audio segments and each of them has different frame size. I need to identify the start position.
Thanks in advance.
There is no such concept as a frame in PCM. The concept of a frame is to indicate points of random access. In PCM every single sample is a point of random access, hence start indicators are not required, and there are no standard frame size. It all up to you.
A PCM frame is different from the frames you're describing, in that a frame is just a single sample on all channels. That is, if I'm recording 16-bit stereo PCM audio, each frame is 4 bytes (32 bits) long.
There is no sync word, nor frame header in raw PCM. It's just a stream of data. You need to know the bit depth, channel count, and current offset if you want to sync to it. (Or, you need to do some simple heuristics. For example, apply several different formats and offsets to a small chunk of data and see which one has the least variance/randomness from sample to sample.)

difference between audio data in .aac and .m4a files at the byte level

This is for trying to understand the aac file structure
For simplicity we can assume m4a file with only aac audio (that is, no video)
I am comparing an m4a file and an aac file made from the m4a file using the faac library
A screenshot of the byte level comparison of the two files is given below:
the upper part is the m4a file and the lower part is the aac file
For the very first frame, from the lower part the AAC ADTS header is FF F9 4C 80 12 3F FC and from here and here the actual aac audio data should have 138 bytes
From the lower part we can see that the bytes from DE to 80 match a block of data in the upper window (the green coloured part)
I had assumed that I had found the offset in the m4a file, from where the actual aac audio data is stored. I though that the bytes 21 4C ... in the upper window contained all the bytes of the next frame of aac audio, and if we look at the lower window, we can indeed see that ... a3 80 (end of the green coloured part in the lower window) is followed by another ADTS header (FF F9 4C 80 12 1F FC, which says the number of audio in the next aac frame should be 137 bytes)
However, the bytes read from the aac file do not match those in the m4a file as expected, as shown in the screenshot below:
They match up to a certain point but everything after that looks random.
What is the relationship between m4a file and the corresponding aac file created from it, at the byte level?
The main goal is to be able to modify this matlab project (which takes m4a input and decodes to raw wav data) to read aac audio data directly from aac file, then use the aac decoder functions from that matlab project unchanged.

How can I change the frequency of a wav file in node.js?

I have a short wav file. This file has a musical note of an instrument recorded, for example: piano playing B3 or 246,942 Hz(pb3.wav), or violin plaing F♯6 or 1479,98 Hz (vfzs6.wav).
I want node.js to save a new wav file based on the file pb3.wav, but played on C4 (261,626 Hz) or D4 (293,665 Hz).
Actually, I can get wav file's frequency, and I have a list for each musical note.

A way to add data "mid stream" to encoded audio (possibly with AAC)

Is there a way to add lossless data to an AAC audio stream?
Essentially I am looking to be able to inject "this frame of audio should be played at XXX time" every n frames in.
If I use a lossless codec I suppose I could just inject my own header mid stream and that data would be intact as it needs to be the same on the way out just like gzip does not loose data.
Any ideas? I suppose I could encode the data into chunks of AAC on the server and on the network layer add a timestamp saying play the following chunk of AAC at time x but I'd prefer to figure a way to add it to the audio itself.
This is not really possible (short of writing your own specialized encoder), as AAC (and MP3) frames are not truly standalone.
There is a concept of the bit reservoir, where unused bandwidth from one frame can be utilized for a later frame that may need more bandwidth to store a more complicated sound. That is, data from frame 1 might be needed in frame 2 and/or 3. If you cut the stream between frames 1 and 2 and insert your alternative frames, the reference to the bit reservoir data is broken and you have damaged frame 2's ability to be decoded.
There are encoders that can work in a mode where the bit reservoir isn't used (at the cost of quality). If operating in this mode, you should be able to cut the stream more freely along frame boundaries.
Unfortunately, the best way to handle this is to do it in the time domain when dealing with your raw PCM samples. This gives you more control over the timing placement anyway, and ensures that your stream can also be used with other codecs.

About definition for terms of audio codec

When I was studying Cocoa Audio Queue document, I met several terms in audio codec. There are defined in a structure named AudioStreamBasicDescription.
Here are the terms:
1. Sample rate
2. Packet
3. Frame
4. Channel
I known about sample rate and channel. How I was confused by the other two. What do the other two terms mean?
Also you can answer this question by example. For example, I have an dual-channel PCM-16 source with a sample rate 44.1kHz, which means there are 2*44100 = 88200 Bytes PCM data per second. But how about packet and frame?
Thank you at advance!
You are already familiar with the sample rate defintion.
The sampling frequency or sampling rate, fs, is defined as the number of samples obtained in one second (samples per second), thus fs = 1/T.
So for a sampling rate of 44100 Hz, you have 44100 samples per second (per audio channel).
The number of frames per second in video is a similar concept to the number of samples per second in audio. Frames for our eyes, samples for our ears. Additional infos here.
If you have 16 bits depth stereo PCM it means you have 16*44100*2 = 1411200 bits per second => ~ 172 kB per second => around 10 MB per minute.
To the definition in reworded terms from Apple:
Sample: a single number representing the value of one audio channel at one point in time.
Frame: a group of one or more samples, with one sample for each channel, representing the audio on all channels at a single point on time.
Packet: a group of one or more frames, representing the audio format's smallest encoding unit, and the audio for all channels across a short amount of time.
As you can see there is a subtle difference between audio and video frame notions. In one second you have for stereo audio at 44.1 kHz: 88200 samples and thus 44100 frames.
Compressed format like MP3 and AAC pack multiple frames in packets (these packets can then be written in MP4 file for example where they could be efficiently interleaved with video content). You understand that dealing with large packets helps to identify bits patterns for better coding efficiency.
MP3, for example, uses packets of 1152 frames, which are the basic atomic unit of an MP3 stream. PCM audio is just a series of samples, so it can be divided down to the individual frame, and it really has no packet size at all.
For AAC you can have 1024 (or 960) frames per packet. This is described in the Apple document you pointed at:
The number of frames in a packet of audio data. For uncompressed audio, the value is 1. For variable bit-rate formats, the value is a larger fixed number, such as 1024 for AAC. For formats with a variable number of frames per packet, such as Ogg Vorbis, set this field to 0.
In MPEG-based file format a packet is referred to as a data frame (not to be
mingled with the previous audio frame notion). See Brad comment for more information on the subject.

Resources