Debug NAudio MP3 reading difference? - audio

My code using NAudio to read one particular MP3 gets different results than several other commercial apps.
Specifically: My NAudio-based code finds ~1.4 sec of silence at the beginning of this MP3 before "audible audio" (a drum pickup) starts, whereas other apps (Windows Media Player, RealPlayer, WavePad) show ~2.5 sec of silence before that same drum pickup.
The particular MP3 is "Like A Rolling Stone" downloaded from Amazon.com. Tested several other MP3s and none show any similar difference between my code and other apps. Most MP3s don't start with such a long silence so I suspect that's the source of the difference.
Debugging problems:
I can't actually find a way to even prove that the other apps are right and NAudio/me is wrong, i.e. to compare block-by-block my code's results to a "known good reference implementation"; therefore I can't even precisely define the "error" I need to debug.
Since my code reads thousands of samples during those 1.4 sec with no obvious errors, I can't think how to narrow down where/when in the input stream to look for a bug.
The heart of the NAudio code is a P/Invoke call to acmStreamConvert(), which is a Windows "black box" call which I can't think how to error-check.
Can anyone think of any tricks/techniques to debug this?

The NAudio ACM code was never originally intended for MP3s, but for decoding constant bit rate telephony codecs. One day I tried setting up the WaveFormat to specify MP3 as an experiment, and what came out sounded good enough. However, I have always felt a bit nervous about decoding MP3s (especially VBR) with ACM (e.g. what comes out if ID3 tags or album art get passed in - could that account for extra silence?), and I've never been 100% convinced that NAudio does it right - there is very little documentation on how exactly you are supposed to use the ACM codecs. Sadly there is no managed MP3 decoder with a license I can use in NAudio, so ACM remains the only option for the time being.
I'm not sure what approach other media players take to playing back MP3, but I suspect many of them have their own built-in MP3 decoders, rather than relying on the operating system.

I've found some partial answers to my own Q:
Since my problem boils down to consuming too much MP3 w/o producing enough PCM, I used conditional-on-hit-count breakpoints to find just where this was happening, then drilled into that.
This showed me that some acmStreamConvert() calls are returning success, consuming 417 src bytes, but producing 0 "dest bytes used".
Next I plan to try acmStreamSize() to ask the codec how many src bytes it "wants" to consume, rather than "telling" it to consume 417.
Edit (followup): I fixed it!
It came down to passing acmStreamConvert() enough src bytes to make it happy. Giving it its acmStreamSize() requested size fixed the problem in some places but then it popped up in others; giving it its requested size times 3 seems to cure the "0 dest bytes used" result in all MP3s I've tested.
With this fix, acmStreamConvert() then sometimes returned much larger converted chunks (almost 32 KB), so I also had to modify some other NAudio code to pass in larger destination buffers to hold the results.

Related

How can I detect corrupt/incomplete MP3 file, from a node.js app?

The common situation when the integrity of an MP3 file is not correct, is when the file has been partially uploaded to the server. In this case, the indicated audio duration doesn't correspond to what is really in the MP3 file: we can hear the beginning, but at some point the playing stops and the indicated duration of the audio player is broken.
I tried with libraries like node-ffprobe, but it seems they just read metadata, without making comparison with real audio data in the file. Is there a way to detect efficiently a corrupted or incomplete MP3 file from node.js?
Note: the client uploading MP3 files is a hardware (an audio recorder), uploading files on a FTP server. Not a browser. So I'm not able to upload potentially more useful data from the client.
MP3 files don't normally have a duration. They're just a series of MPEG frames. Sometimes, there is an ID3 tag indicating duration, but not always.
Players can determine duration by choosing one of a few methods:
Decode the entire audio file.This is the slowest method, but if you're going to decode the file anyway, you might as well go this route as it gives you an exact duration.
Read the whole file, skimming through frame headers.You'll have to read the whole file from disk, but you won't have to decode it. Can be slow if I/O is slow, but gives you an exact duration.
Read the first frame's bitrate and estimate duration by file size.Definitely the fastest method, and the one most commonly used by players. Duration is an estimate only, and is reasonably accurate for CBR, but can be wildly inaccurate for VBR.
What I'm getting at is that these files might not actually be broken. They might just be VBR files that your player doesn't know the duration of.
If you're convinced they are broken (such as stopping in the middle of content), then you'll have to figure out how you want to handle it. There are probably only a couple ways to determine this:
Ideally, there's an ID3 tag indicating duration, and you can decode the whole file and determine its real duration to compare.
Usually, that ID3 tag won't exist, so you'll have to check to see if the last frame is complete or not.
Beyond that, you don't really have a good way of knowing if the stream is incomplete, since there is no outer container that actually specifies number of frames to expect.
The expression for calculating the filesize of an mp3 based on duration and encoding (from this answer) is quite simple:
x = length of song in seconds
y = bitrate in kilobits per second
(x * y) / 1024 = filesize (MB)
There is also a javascript implementation for the Web Audio API in another answer on that same question. Perhaps that would be useful in your Node implementation.
mp3diags is some older open source software for fixing mp3s and which was great for batch processing stuff like this. The source is c++ and still available if you're feeling nosy and want to see how some of these features are implemented.
Worth a look since it has some features that might be be useful in your context:
What is MP3 Diags and what does it do?
low quality audio
missing VBR header
missing normalization data
Correcting files that show incorrect song duration
Correcting files in which the player cannot seek correctly

What's FFmpeg doing with avcodec_send_packet()?

I'm trying to optimise a piece of software for playing video, which internally uses the FFmpeg libraries for decoding. We've found that on some large (4K, 60fps) video, it sometimes takes longer to decode a frame than that frame should be displayed for; sadly, because of the problem domain, simply buffering/skipping frames is not an option.
However, it appears that the FFmpeg executable is able to decode the video in question fine, at about 2x speed, so I've been trying to work out what we're doing wrong.
I've written a very stripped-back decoder program for testing; the source is here (it's about 200 lines). From profiling it, it appears that the one major bottleneck during decoding is the avcodec_send_packet() function, which can take up to 50ms per call. However, measuring the same call in FFmpeg shows strange behaviour:
(these are the times taken for each call to avcodec_send_packet() in milliseconds, when decoding a 4K 25fps VP9-encoded video.)
Basically, it seems that when FFmpeg uses this function, it only really takes any amount of time to complete every N calls, where N is the number of threads being used for decoding. However, both my test decoder and the actual product use 4 threads for decoding, and this doesn't happen; when using frame-based threading, the test decoder behaves like FFmpeg using only 1 thread. This would seem to indicate that we're not using multithreading at all, but we've still seen performance improvements by using more threads.
FFmpeg's results average out to being about twice as fast overall as our decoders, so clearly we're doing something wrong. I've been reading through FFmpeg's source to try to find any clues, but it's so far eluded me.
My question is: what's FFmpeg doing here that we're not? Alternatively, how can we increase the performance of our decoder?
Any help is greatly appreciated.
I was facing the same problem. It took me quite a while to figure out a solution which I want to share here for future references:
Enable multithreading for the decoder. Per default the decoder only uses one thread, depending on the decoder, multithreading can speed up decoding drastically.
Assuming you have AVFormatContext *format_ctx, a matching codec AVCodec* codec and AVCodecContext* codec_ctx (allocated using avcodec_alloc_context3).
Before opening the codec context (using avcodec_open2) you can configure multithreading. Check the capabilites of the codec in order to decide which kind of multithreading you can use:
// set codec to automatically determine how many threads suits best for the decoding job
codec_ctx->thread_count = 0;
if (codec->capabilities | AV_CODEC_CAP_FRAME_THREADS)
codec_ctx->thread_type = FF_THREAD_FRAME;
else if (codec->capabilities | AV_CODEC_CAP_SLICE_THREADS)
codec_ctx->thread_type = FF_THREAD_SLICE;
else
codec_ctx->thread_count = 1; //don't use multithreading
Another speed-up I found out is the following: keep sending packets to the decoder (thats what avcodec_send_packet() is doing) until you get AVERROR(EAGAIN) as return value. This means the internal decoder buffers are full and you first need to collect the decoded frames (but remember to send this last packet again after the decoder is empty again). Now you can collect the decoded frames using avcodec_receive_frame until you get AVERROR(EAGAIN) again.
Some decoders work way faster when they have mutiple frames queued for decoding (thats what the decoder does when codec_ctx->thread_type = FF_THREAD_FRAME is set).
avcodec_send_packet() and avcodec_receive_frame() are wrapper functions most important thing those do is calling selected codec's decode function and returns decoded frame (or error).
Try tweaking the codec options, for example, low latency may not give you what you want. And sometimes old api (I believe it still around) avcodec_decode_video2() outperforms newer one, you may try that too.

Is MMAP what I need from ALSA to play simultaneous, immediate sounds in my game?

I'm new to ALSA and I've managed to get PCM sound played in SND_PCM_ACCESS_RW_INTERLEAVED mode. My problem is that I just can't find a way to make that mode useful for what I'm trying to do. (If someone can tell me how, I'll be glad to read). I've been reading there is this MMAP mode, but it's not as easy to find simple examples for it. I wonder if it is what I need and how I could implement it.
What I want to do is have my little game (a simple space shoot-up) to immediately play a sound when I shoot or get shot. If an enemy shoots while another sound is being played, the sounds should add up and saturate as necessary, but no sound event should be interrupted. In other words, I need to be able to edit the very byte that's about to be played.
In my useless attempts to try MMAP (without really knowing how it works in practice; just following vague theoretical instructions), I set up everything just like for SND_PCM_ACCESS_RW_INTERLEAVED, but change it to SND_PCM_ACCESS_MMAP_INTERLEAVED. Then I call snd_pcm_avail_update, which seems to work and returns a large number of available frames. After that, I call snd_pcm_mmap_begin, passing the parameters, previously filling "frames" with a reasonable number (a 10, for example). The function fails and returns an error code -77. I haven't been able to find what that means. The areas array remains unmodified.
What does that error mean? Where can I get a list of the errors? How can I overcome it? Is there a good, simple, example of how to use MMAP (or some other thing) to perform something more or less like what I'm trying to do?
I appreciate your help :)
ALSA returns negative values on error. 77 is most likely EBADFD which indicates that the device is in an invalid state (under/overrun or not running at all). In case of underrun you're probably using a too low buffersize.
In any case, there's no way to modify audio data that you've already submitted to the alsa driver (snd_pcm_mmap_commit/writei/writen). The trick to have audio sound immediately is just to use very low buffer sizes, < 10ms will do. For this you'll want to use hw: devices, other device types usually add latency.
You still have to mix sounds together manually before you pass them to alsa.
There's a nice mmap example in the comments on this question: Alsa api: how to use mmap in c?.
That being said, ALSA is a valid choice for this kind of application but you don't necessarily need to use memory mapping. Read/write access doesn't introduce additional latency, it just copies audio around a bit more.

Download last 30 seconds of an mp3

Is it possible to download only the last 30 seconds of an mp3? Or is it necessary to download the whole thing and crop it after the fact? I would be downloading via http, i.e. I have the URL of the file but that's it.
No, it is not possible... at least not without knowing some more information first.
The real problem here is determining at what byte offset the last 30 seconds is. This is a product of knowing:
Sample Rate
Bit Depth (per sample)
# of Channels
CBR or VBR
Bit Rate
Even then, you're not going to get that with a VBR MP3 file, and even with CBR, who knows how big the ID3 and other crap at the beginning of the file is. Even if you know all of that, there is still some variability, as you have the problem of the bit reservoir.
The only way to know would be to download the whole file and use a tool such as FFMPEG to find out the right offset. Then if you want to play it, you'll want to add the appropriate headers, and make sure you are trimming on an eligible frame, or fix the bit reservoir yourself.
Now, if this could all be figured out server-side ahead of time, then yes, you could request the offset from the server, and then download from there. As for how to download it, your question is very incomplete and didn't mention what protocol you were using, so I cannot help you there.

low latency sounds on key presses

I am trying to write an application(I'm a gui first timer) for my son, he has autism. There is a video player in the top half and a text entry area in the bottom. When letters are typed sounds are produced to mimic the words in the video.
There have been other posts on this site in regard to playing sounds on key presses, using gstreamer as a system call. I have also tried libcanberra but both seem to have significant delays between sounds. I can write the app in python or C but will likely do at least some of it in C.
I also want to mention that the video portion is being played by gstreamer. I tried to create two instances of gstreamer, to avoid expensive system calls but the audio instance seemed to kill the app when called.
If anyone has any tips on creating faster responding sounds I would really appreciate it.
You can upload a raw audio sample directly to PulseAudio so there will be no decoding and (perhaps save) extra switches by using the following function from Canberra:
http://developer.gnome.org/libcanberra/unstable/libcanberra-canberra.html#ca-context-cache
The next ca_context_play() will use it.
However, the biggest problem you'll encounter with this scenario (with simultaneous video playback) is that the audio device might be configured with large latency with PulseAudio (up to 1/2s or more for normal playback). It may be reasonable to file a bug to libcanberra to support a LOW_LATENCY flag, as it currently doesn't attempt to minimize delay for sound events afaik. That would be great to have.
GStreamer pulsesink could probably get low latency too (it has some properties for that), but I am afraid it won't be as lightweight as libcanberra, and you won't be able to cache a sample for instance. Ideally, GStreamer could also learn to cache samples, or pre-fill PulseAudio...

Resources