IMFTransform SetInputType()/SetOutputType() fails - audio

I'm trying to playback MP3 (and similar audio files) using WASAPI shared mode and a media foundation IMFSourceReader on Windows 7. From what I understand I have to use an IMFTransform between the IMFSourceReader decoding and the WASAPI playback. Everything seems fine apart from when I call SetInputType()/SetOutputType() on the IMFTransform?
The relevant snippets of code are:
MFCreateSourceReaderFromURL(...); // Various test mp3 files
...
sourceReader->GetCurrentMediaType(MF_SOURCE_READER_FIRST_AUDIO_STREAM, &reader.audioType);
//sourceReader->GetNativeMediaType(MF_SOURCE_READER_FIRST_AUDIO_STREAM, 0, &reader.audioType);
...
audioClient->GetMixFormat(&player.mixFormat);
...
MFCreateMediaType(&player.audioType);
MFInitMediaTypeFromWaveFormatEx(player.audioType, player.mixFormat, sizeof(WAVEFORMATEX) + player.mixFormat->cbSize);
...
hr = CoCreateInstance(CLSID_CResamplerMediaObject, NULL, CLSCTX_INPROC_SERVER, IID_IUnknown, (void**)&unknown);
ASSERT(SUCCEEDED(hr));
hr = unknown->QueryInterface(IID_PPV_ARGS(&resampler.transform));
ASSERT(SUCCEEDED(hr));
unknown->Release();
hr = resampler.transform->SetInputType(0, inType, 0);
ASSERT(hr != DMO_E_INVALIDSTREAMINDEX);
ASSERT(hr != DMO_E_TYPE_NOT_ACCEPTED);
ASSERT(SUCCEEDED(hr)); // Fails here with hr = 0xc00d36b4
hr = resampler.transform->SetOutputType(0, outType, 0);
ASSERT(hr != DMO_E_INVALIDSTREAMINDEX);
ASSERT(hr != DMO_E_TYPE_NOT_ACCEPTED);
ASSERT(SUCCEEDED(hr)); // Fails here with hr = 0xc00d6d60
I suspect I am misunderstanding how to negotiate the input/output IMFMediaType's between things, and also how to take into consideration that IMFTransform needs to operate on uncompressed data?
It seems odd to me the output type fails but maybe that is a knock on effect of the input type failing first - and if I try to set the output type first it fails also.

In recent versions of Windows you would probably prefer to take advantage of stock functionality which is already there for you.
When you configure Source Reader object, IMFSourceReader::SetCurrentMediaType lets you specify media type you want your data in. If you set media type compatible with WASAPI requirements, Source Reader would automatically add transform to convert the data for you.
However...
Audio resampling support was added to the source reader with Windows 8. In versions of Windows prior to Windows 8, the source reader does not support audio resampling. If you need to resample the audio in versions of Windows earlier than Windows 8, you can use the Audio Resampler DSP.
... which means that indeed you might need to manage the MFT yourself. The input media type for the MFT is supposed to be coming from IMFSourceReader::GetCurrentMediaType. To instruct source reader to use uncompressed audio you need to build a media type decoder for this type of stream would decode audio to. For example, if your file is MP3 then you would read number of channels, sampling rate and build a compatible PCM media type (or take system decoder and ask it separately for output media type, which is even a cleaner way). You would set this uncompressed audio media type using IMFSourceReader::SetCurrentMediaType. This media type would also be your input media type for audio resampler MFT. This would instruct source reader to add necessary decoders and IMFSourceReader::ReadSample would give you converted data.
Output media type for reasmpler MFT would be derived from audio format you obtained from WASAPI and converted using API calls you mentioned at the top of your code snippet.
To look the error codes up you can use this:
https://www.magnumdb.com/search?q=0xc00d36b4
https://www.magnumdb.com/search?q=0xc00d6d60
Also, you, generally, should be able to play audio files using Media Foundation Media Session API with smaller effort. Media Session uses the same primitives to build a playback pipeline and takes care of format fitting.
Ah so are you saying I need to create an additional object that is the decoder to fit between the IMFSourceReader and IMFTransform/Resampler?
No. By doing SetCurrentMediaType with proper media type you have Source Reader adding decoder internally so that it could give you already decompressed data. Starting with Windows 8 it is also capable to do conversion between PCM flavors, but in Windows 7 you need to take care of this yourself with Audio Resampler DSP.
You can manage decoder yourself but you don't need to since Source Reader's decoder would do the same more reliably.
You might need a separate decoder just to help you guess what PCM media type decoder would produce so that you request it from Source Reader. MFTEnumEx is proper API to look decoder up.
I am not sure how to decide on or create a suitable decoder object? Do I need to enumerate a list of suitable ones somehow rather than assume specific ones?
The mentioned MFTEnum, MFTEnumEx API calls can enumerate decoders, both all available or filtered by given criteria.
One another way is to use partial media type (see relevant explanation and code snippet here: Tutorial: Decoding Audio). Partial media type is a signal about desired format requesting that Media Foundation API supplies a primitive that matches this partial type. See comments below for related discussion links.

Related

About DirectShow source filter

I have created (C++, Win10, VS2022) a simple source DirectShow filter. It gets audio stream from the external source (file – for testing, network – in future) and produces audio stream on output pin, which I connect to soundspeaker.
In order to do it I have implemented FillBuffer method for the output pin (CSourceStream) of the filter. Media type - MEDIATYPE_Stream/MEDIASUBTYPE_PCM.
Before being connected the pin gets info about media type via SetMediaType (WAVEFORMATEX) and remembers parameters of audio - wBitsPerSample; nSamplesPerSec; nChannels. Audio stream comes from the external source (file or net) to FillBuffer with the parameters - wBitsPerSample; nSamplesPerSec; nChannels. It works fine.
But I need to handle situation, when external source will send audio stream to the filter , with another parameter (for example, old sample had 11025 Hz, and the current = 22050).
Could you help me – which actions and calls should I make in FillBuffer() method if I will receive audio stream with changed wBitsPerSample or nSamplesPerSec or nChannels parameter ?
The fact is that these parameters have already been agreed between my output pin and the input pin of the soundspeaker and I need to change these agreement correctly.
You need to improve the implementation and handle
Dynamic Format Changes
...
QueryAccept (Downstream) is used when If an output pin proposes a format change to its downstream peer, but only if the new format does not require a larger buffer.
This might be not trivial because baseline DirectShow filters are not required to support dynamic changes. That is, ability to change format is dependent on your actual pipeline and implementation of other filters.
You should also be able to find SDK helpers, CDynamicSourceStream and CDynamicSource.

Why can I sometimes concatenate audio data using NodeJS Buffers, and sometimes I cannot?

As part of a project I am working on, there is a requirement to concatenate multiple pieces of audio data into one large audio file. The audio files are generated from four sources, and the individual files are stored in a Google Cloud storage bucket. Each file is an mp3 file, and it is easy to verify that each individual file is generating correctly (individually, I can play them, edit them in my favourite software, etc.).
To merge the audio files together, a nodejs server loads the files from the Google Cloud storage as an array buffer using an axios POST request. From there, it puts each array buffer into a node Buffer using Buffer.from(), so now we have an array of Buffer objects. Then it uses Buffer.concat() to concatenate the Buffer objects into one big Buffer, which we then convert to Base64 data and send to the client server.
This is cool, but the issue arises when concatenating audio generated from different sources. The 4 sources I mentioned above are Text to Speech software platforms, such as Google Cloud Voice and Amazon Polly. Specifically, we have files from Google Cloud Voice, Amazon Polly, IBM Watson, and Microsoft Azure Text to Speech. Essentially just five text to speech solutions. Again, all individual files work, but when concatenating them together via this method there are some interesting effects.
When the sound files are concatenated, seemingly depending on which platform they originate from, the sound data either will or will not be included in the final sound file. Below is a 'compatibility' table based on my testing:
|------------|--------|--------|-----------|-----|
| Platform / | Google | Amazon | Microsoft | IBM |
|------------|--------|--------|-----------|-----|
| Google | Yes | No | No | No |
|------------|--------|--------|-----------|-----|
| Amazon | | No | No | Yes |
|------------|--------|--------|-----------|-----|
| Microsoft | | | Yes | No |
|------------|--------|--------|-----------|-----|
| IBM | | | | Yes |
|------------|--------|--------|-----------|-----|
The effect is as follows: When I play the large output file, it will always start playing the first sound file included. From there, if the next sound file is compatible, it is heard, otherwise it is skipped entirely (no empty sound or anything). If it was skipped, the 'length' of that file (for example 10s long audio file) is included at the end of the generated output sound file. However, the moment that my audio player hits the point where the last 'compatible' audio has played, it immediately skips to the end.
As a scenario:
Input:
sound1.mp3 (3s) -> Google
sound2.mp3 (5s) -> Amazon
sound3.mp3 (7s)-> Google
sound4.mp3 (11s) -> IBM
Output:
output.mp3 (26s) -> first 10s is sound1 and sound3, last 16s is skipped.
In this case, the output sound file would be 26s seconds long. For the first 10 seconds, you would hear the sound1.mp3 and sound3.mp3 played back to back. Then at 10s (at least playing this mp3 file in firefox) the player immediately skips to the end at 26s.
My question is: Does anyone have any ideas why sometimes I can concatenate audio data in this way, and other times I cannot? And how come there is this 'missing' data included at the end of the output file? Shouldn't concatenating the binary data work in all cases if it works for some cases, as all the files have mp3 encoding? If I am wrong please let me know what I can do to successfully concatenate any mp3 files :)
I can provide my nodeJS backend code, but the process and methods used are described above.
Thanks for reading?
Potential Sources of Problems
Sample Rate
44.1 kHz is often used for music, as it's what is used on CD audio. 48 kHz is usually used for video, as it's what was used on DVDs. Both of those sample rates are much higher than is required for speech, so it's likely that your various text-to-speech providers are outputting something different. 22.05 kHz (half of 44.1 kHz) is common, and 11.025 kHz is out there too.
While each frame specifies its own sample rate, making it possible to generate a stream with varying sample rates, I've never seen a decoder attempt to switch sample rates mid-stream. I suspect that the decoder is skipping these frames, or maybe even skipping over an arbitrary block until it gets consistent data again.
Use something like FFmpeg (or FFprobe) to figure out what the sample rates of your files are:
ffmpeg -i sound2.mp3
You'll get an output like this:
Duration: 00:13:50.22, start: 0.011995, bitrate: 192 kb/s
Stream #0:0: Audio: mp3, 44100 Hz, stereo, fltp, 192 kb/s
In this example, 44.1 kHz is the sample rate.
Channel Count
I'd expect your voice MP3s to be in mono, but it wouldn't hurt to check to be sure. As with above, check the output of FFmpeg. In my example above, it says stereo.
As with sample rate, technically each frame could specify its own channel count but I don't know of any player that will pull off switching channel count mid-stream. Therefore, if you're concatenating, you need to make sure all the channel counts are the same.
ID3 Tags
It's common for there to be ID3 metadata at the beginning (ID3v2) and/or end (ID3v1) of the file. It's less expected to have this data mid-stream. You would want to make sure this metadata is all stripped out before concatenating.
MP3 Bit Reservoir
MP3 frames don't necessarily stand alone. If you have a constant bitrate stream, the encoder may still use less data to encode one frame, and more data to encode another. When this happens, some frames contain data for other frames. That way, frames that could benefit from the extra bandwidth can get it while still fitting the whole stream within a constant bitrate. This is the "bit reservoir".
If you cut a stream and splice in another stream, you may split up a frame and its dependent frames. This typically causes an audio glitch, but may also cause the decoder to skip ahead. Some badly behaving decoders will just stop playing altogether. In your example, you're not cutting anything so this probably isn't the source of your trouble... but I mention it here because it's definitely relevant to the way you're working these streams.
See also: http://wiki.hydrogenaud.io/index.php?title=Bit_reservoir
Solutions
Pick a "normal" format, resample and rencode non-conforming files
If most of your sources are all the exact same format and only one or two outstanding, you could convert the non-conforming file. From there, strip ID3 tags from everything and concatenate away.
To do the conversion, I'd recommend kicking it over to FFmpeg as a child process.
child_process.spawn('ffmpeg' [
// Input
'-i', inputFile, // Use '-' to write to STDIN instead
// Set sample rate
'-ar', '44100',
// Set audio channel count
'-ac', '1',
// Audio bitrate... try to match others, but not as critical
'-b:a', '64k',
// Ensure we output an MP3
'-f', 'mp3',
// Output
outputFile // As with input, use '-' to write to STDOUT
]);
Best Solution: Let FFmpeg (or similar) do the work for you
The simplest, most robust solution to all of this is to let FFmpeg build a brand new stream for you. This will cause your audio files to be decoded to PCM, and a new stream made. You can add parameters to resample those inputs, and modify channel counts if needed. Then output one stream. Use the concat filter.
This way, you can accept audio files of any type, you don't have to write the code to hack those streams together, and once setup you won't have to worry about it.
The only downside is that it will require a re-encoding of everything, meaning another generation of quality lost. This would be required for any non-conforming files anyway, and it's just speech, so I wouldn't give it a second thought.
#Brad's answer was the solution! The first solution he suggested worked. It took some messing around getting FFMpeg to work correctly, but in the end using the fluent-ffmpeg library worked.
Each file in my case was stored on Google Cloud Storage, and not on the server's hard drive. This posed some problems for FFmpeg, as it requires file paths to have multiple files, or an input stream (but only one is supported, as there is only one STDIN).
One solution is to put the files on the hard drive temporarily, but this would not work for our use case as we may have a lot of use in this function and the hard drive adds latency.
So, instead we did as suggested and loaded each file into ffmpeg to convert it into a standardized format. This was a bit tricky, but in the end requesting each file as a stream, using that stream as an input for ffmpeg, then using fluent-ffmpeg's pipe() method (which returns a stream) as output worked.
We then bound an event listener to the 'data' event for this pipe, and pushed the data to an array (bufs.push(data)), and on stream 'end' we concatenated this array using Buffer.concat(bufs), followed by a promise resolve.
Then once all requests promises were resolved, we could be sure ffmpeg had processed each file, and then those buffers were concatenated in the required groups as before using Buffer.concat(), converted to base64 data, and sent to the client.
This works great, and now it seems to be able to handle every combination of files/sources I can throw at it!
In conclusion:
The answer to the question was that the mp3 data must have been encoded differently (different channels, sample rates, etc.), and loading it through ffmpeg and outputing it in a 'unified' way made the mp3 data compatible.
The solution was to process each file in ffmpeg separately, pipe the ffmpeg output into a buffer, then concatenate the buffers.
Thanks #Brad for your suggestions and detailed answer!

mpeg-dash and codecs specification

Looking at the article :http://www.streamingmedia.com/Articles/Editorial/What-Is-.../What-is-MPEG-DASH-79041.aspx
And it makes statements like:DASH is codec-independent, and will work with H.264, WebM and other codecs
DASH supports both the ISO Base Media File Format (essentially the MP4 format) and MPEG-2 transport streams
DASH does not specify a DRM method but supports all DRM techniques specified in ISO/IEC 23001-7: Common Encryption
But how is audio/video compression, or DRM method is specified in Media Presentation? Where cab i find more details?
DASH is a streaming protocol - the video stream is inside a 'container' and the container is broken into chunks and streamed. A very high level view of the video component is:
elementary video stream encoded with some codec
fragmented mp4 container (broken into chunks to facilitate ABR)
MPEG DASH streaming protocol
The mp4 container header information contains information about all the streams it contains - this will include the codec that it used to encode the stream (e.g. h.264 for a video stream).
ABR essentially allows the client device or player download the video in chunks, e.g 10 second chunks, and select the next chunk from the bit rate most appropriate to the current network conditions.
The DASH manifest (essentially an index file that contains pointers to the different bit rate streams etc) contains header information about the protections systems in use, for example Widevine or PlayReady DRMs.
The mp4 container also contains information about the protection system in a special PSSH (Protection System Specific Headers) header for the protection systems in use, for example again, Widevine or PlayReady.
Generally DASH streams will have the protection information in both places to ensure that all players can play the stream, but last time I looked, I think the spec strictly speaking says it can be in either or both.
The specs themselves are available here:
http://standards.iso.org/ittf/PubliclyAvailableStandards/index.html (search for DASH)
https://www.iso.org/standard/68042.html - unfortunately, this one requires payment AFAIK. You can see a W3C spec which uses it here, however: https://w3c.github.io/encrypted-media/format-registry/stream/mp4.html
And there is a nice overview of DASH here:
https://www.w3.org/2011/09/webtv/slides/W3C-Workshop.pdf
And, of course, the classic reference to some of the drivers for DASH and similar standards:
https://xkcd.com/927/

J2ME - Fm recording app - can't buffer and write to a file

Hey every one I was developing a J2ME app that records fm radio,I have tried so many methods but I have failed. The major problem I faced is that in the media api for J2ME once the code for tuning into a specific fm channel is written(and works but only outputs directly to the speaker) I couldn't find a way to buffer the output and write it into a file.Thanks in advance.
I think it is not possible with MMAPI directly. I assume the fm radio streams via RTSP, and you can specify it as data source for MMAPI, but if you want to store the audio data, you need to fetch it in your own buffer instead, and then pass to MMAPI Player via InputStream.
In that way you will need to code your own handling for RTSP (or whatever your fm radio uses), and convert data into format acceptable by MMAPI Player via InputStream, for example audio/x-wav or audio/amr. If header of the format doesn't need to specify length of data, then you probably can 'stream' it via your buffer receiving data from RTSP source.
This is some kind of low level coding, I think it will be hard to implement in J2ME.

Adding audio effects (reverb etc..) to a BackgroundAudioPlayer driven streaming audio app

I have a windows phone 8 app which plays audio streams from a remote location or local files using the BackgroundAudioPlayer. I now want to be able to add audio effects, for example, reverb or echo, etc...
Please could you advise me on how to do this? I haven't been able to find a way of hooking extra audio processing code into the pipeline of audio processing even through I've read much about WASAPI, XAudio2 and looked at many code examples.
Note that the app is written in C# but, from my previous experience with writing audio processing code, I know that I should be writing the audio code in native C++. Roughly speaking, I need to find a point at which there is an audio buffer containing raw PCM data which I can use as an input for my audio processing code which will then write either back to the same buffer or to another buffer which is read by the next stage of audio processing. There need to be ways of synchronizing what happens in my code with the rest of the phone's audio processing mechanisms and, of course, the process needs to be very fast so as not to cause audio glitches. Or something like that; I'm used to how VST works, not how such things might work in the Windows Phone world.
Looking forward to seeing what you suggest...
Kind regards,
Matt Daley
I need to find a point at which there is an audio buffer containing
raw PCM data
AFAIK there's no such point. This MSDN page hints that audio/video decoding is performed not by the OS, but by the Qualcomm chip itself.
You can use something like Mp3Sharp for decoding. This way the mp3 will be decoded on the CPU by your managed code, you can interfere / process however you like, then feed the PCM into the media stream source. Main downside - battery life: the hardware-provided codecs should be much more power-efficient.

Resources