What is Web Audio API's bit depth? - audio

What is the bit depth of Web Audio API's audio context?
For example if you want to create a custom curve to use with WaveShaperNode what is the appropriate Float32Array size?
I have seen developers using 65536 which is for 16-Bit audio, but i cant find any info in the spec.

Actually, internally the system uses Float32, which has a significand of 23 bits. Using floating point allows the ability to avoid most clipping problems, while enabling good precision. This means technically there is little point in ever attempting to create a waveshaping curve larger than 8388608 (2^23) samples; but in reality, a 16-bit curve is pretty high-resolution (signal-to-noise is ~96dB). A lot of the reason for 32-bit audio processing was to avoid clipping problems, not improving SNR of input/output; use of floating point helps this dramatically. The WaveShaperNode specifically clips to [-1, +1] (most nodes don't), incidentally.
So in short - just use 16-bit (65535), but make sure your signal is in the -1,+1 range.

Related

svg viewbox resolution limits

I was wondering if there are any hard limits to the viewbox of a svg element. I see weird clipping when I reach very low values ( say when vb width is around .002 ) or very large ones firefox starts to play funny around 200000000 width.
Is there a rule, a spec somewhere where I can find the current limits ?
Fiddle here:
var dim = 0.00002;
http://jsfiddle.net/7v36sLj8/13/
You'll see funny things starting to happen from that dim onwards, you can decrease by a factor 10 or increase by a factor 10 as fitting.
Thanks for the answers, I'll just take the min/maxes for the lowest common denominator which seems to be ffox for now. ( thanks for answers, also Rob's answer explains why ffox has a much lower threshold on linux / osx).
Firefox originally used a graphics library called cairo to do cross platform graphics rendering. Cairo only allows units to have 32 bits of fixed point binary precision so Firefox chose 24 bits before the binary point and 8 bits of binary fraction. So the maximum co-ordinates are then 2^24 and the smallest deltas 1/256.
Firefox has been replacing cairo with more direct platform rendering e.g. Direct2D on Windows which is used in preference to cairo now if you have a hardware acceleration capable graphics chip and have hardware acceleration enabled. The platform libraries don't have the cairo range limitation but do seem to have their own bugs with large co-ordinates.
The spec requires that browsers support single-precision floating point numbers. Browsers are encouraged to use double-precision numbers for some calculations, mostly matrix transformations, where small decimals are often important, but the general rule is standard C++ "float" data type.
From http://www.w3.org/TR/SVG11/types.html#Precision:
4.3 Real number precision
Unless stated otherwise for a particular attribute or property, a has the capacity for at least a single-precision floating point number and has a range (at a minimum) of -3.4e+38F to +3.4e+38F.
It is recommended that higher precision floating point storage and computation be performed on operations such as coordinate system transformations to provide the best possible precision and to prevent round-off errors.
Conforming High-Quality SVG Viewers are required to use at least double-precision floating point for intermediate calculations on certain numerical operations.
How does that relate to your issue?
A value of 0.002 shouldn't be a problem at all. Numbers like 200000000 would only be a problem if you then needed fine decimals. If your viewbox was "200000000 200000000 0.002 0.002" -- in other words, a very small range of very large numbers -- then floating point precision would likely be the problem. However, if there's a problem with low-precision large numbers or with decimals that can be exactly encoded by a float, then the browser isn't living up to the specs.
It could be that the browser is trying to smooth shapes, but is rounding to the nearest user unit instead of rounding to the nearest display pixel. Can you put together a simple example that demonstrates the specific problems you're seeing?

How can a jpeg encoder become more efficient

Earlier I read about mozjpeg. A project from Mozilla to create a jpeg encoder that is more efficient, i.e. creates smaller files.
As I understand (jpeg) codecs, a jpeg encoder would need to create files that use an encoding scheme that can also be decoded by other jpeg codecs. So how is it possible to improve the codec without breaking compatibility with other codecs?
Mozilla does mention that the first step for their encoder is to add functionality that can detect the most efficient encoding scheme for a certain image, which would not break compatibility. However, they intend to add more functionality, first of which is "trellis quantization", which seems to be a highly technical algorithm to do something (I don't understand).
I'm also not entirely sure this quetion belongs on stack overflow, it might also fit superuser, since the question is not specifically about programming. So if anyone feels it should be on superuser, feel free to move this question
JPEG is somewhat unique in that it involves a series of compression steps. There are two that provide the most opportunities for reducing the size of the image.
The first is sampling. In JPEG one usually converts from RGB to YCbCR. In RGB, each component is equal in value. In YCbCr, the Y component is much more important than the Cb and Cr components. If you sample the later at 4 to 1, a 4x4 block of pixels gets reduced from 16+16+16 to 16+1+1. Just by sampling you have reduced the size of the data to be compressed by nearly 1/3.
The other is quantization. You take the sampled pixel values, divide them into 8x8 blocks and perform the Discrete Cosine transform on them. In 8bpp this takes 8x8 8-bit data and converts it to 8x8 16 bit data (inverse compression at that point).
The DCT process tends to produce larger values in the upper right corner and smaller values (close to zero) towards the lower left corner. The upper right coefficients are more valuable than the lower left coefficients.
The 16-bit values are then "quantized" (division in plain english).
The compression process defines an 8x8 quantization matrix. Divide the corresponding entry in the DCT coefficients by the value in the quantization matrix. Because this is integer division, the small values will go to zero. Long runs of zero values are combined using run-length compression. The more consecutive zeros you get, the better the compression.
Generally, the quantization values are much higher at the lower left than in the upper right. You try to force these DCT coefficients to be zero unless they are very large.
This is where much of the loss (not all of it though) comes from in JPEG.
The trade off is to get as many zeros as you can without noticeably degrading the image.
The choice of quantization matrices is the major factor in compression. Most JPEG libraries present a "quality" setting to the user. This translates into the selection of a quantization matrices in the encoder. If someone could devise better quantization matrices, you could get better compression.
This book explains the JPEG process in plain English:
http://www.amazon.com/Compressed-Image-File-Formats-JPEG/dp/0201604434/ref=sr_1_1?ie=UTF8&qid=1394252187&sr=8-1&keywords=0201604434
JPEG provides you multiple options. E.g. you can use standard Huffman tables or you can generate Huffman tables optimal for a specific image. The same goes for quantization tables. You can also switch to using arithmetic coding instead of Huffman coding for entropy encoding. The patents covering arithmetic coding as used in JPEG have expired. All of these options are lossless (no additional loss of data). One of the options used by Mozilla is instead of using baseline JPEG compression they use progressive JPEG compression. You can play with how many frequencies you have in each scan (SS, spectral selection) as well as number of bits used for each frequency (SA, successive approximation). Consecutive scans will have additional frequencies and or addition bits for each frequency. Again all of these different options are lossless. For the standard images used for JPEG switching to progressive encoding improved compression from 41 KB per image to 37 KB. But that is just for one setting of SS and SA. Given the speed of computers today you could automatically try many many different options and choose the best one.
Although hardly used the original JPEG standard had a lossless mode. There were 7 different choices for predictors. Today you would compress using each of the 7 choices and pick the best one. Use the same principle for what I outlined above. And remember non of them encounter additional loss of data. Switching between them is lossless.

FFTW for exponential frequency axis

I have a group of related questions regarding FFTW and audio analysis on Linux.
What is the easiest-to-use, most comprehensive audio library in Linux/Ubuntu that will allow me to decode any of a variety of audio formats (MP3, etc.) and acquire a buffer of raw 16-bit PCM values? gstreamer?
I intend on taking that raw buffer and feeding it to FFTW to acquire frequency-domain data (without complex information or phase information). I think I should use one of their "r2r" methods, probably the DHT. Is this correct?
It seems that FFTW's output frequency axis is discretized in linear increments that are based on the buffer length. It further seems that I can't change this discretization within FFTW so I must do it after the DHT. Instead of a linear frequency axis, I need an exponential axis that follows 2^(i/12). I think I'll have to take the DHT output and run it through some custom anti-aliasing function. Is there a Linux library to do such anti-aliasing? If not, would a basic cosine-based anti-aliasing function work?
Thanks.
This is an age old problem with FFTs and working with audio - ideally we want a log frequency scale for audio but the DFT/FFT has a linear scale. You will need to choose an FFT size that gives sufficient resolution at the low end of your frequency range, and then accumulate bins across the frequency range of interest to give yourself a pseudo-logarithmic representation. There are more complex schemes, but essentially it all boils down to the same thing.
I've seen libsndfile used all over the place:
http://www.mega-nerd.com/libsndfile/
It's LGPL too. It can read pretty much all the open source and lossless audio format you would care about. It doesn't do MP3, however, because of licensing costs.

8 bit audio samples to 16 bit

This is my "weekend" hobby problem.
I have some well-loved single-cycle waveforms from the ROMs of a classic synthesizer.
These are 8-bit samples (256 possible values).
Because they are only 8 bits, the noise floor is pretty high. This is due to quantization error. Quantization error is pretty weird. It messes up all frequencies a bit.
I'd like to take these cycles and make "clean" 16-bit versions of them. (Yes, I know people love the dirty versions, so I'll let the user interpolate between dirty and clean to whatever degree they like.)
It sounds impossible, right, because I've lost the low 8 bits forever, right? But this has been in the back of my head for a while, and I'm pretty sure I can do it.
Remember that these are single-cycle waveforms that just get repeated over and over for playback, so this is a special case. (Of course, the synth does all kinds of things to make the sound interesting, including envelopes, modulations, filters cross-fading, etc.)
For each individual byte sample, what I really know is that it's one of 256 values in the 16-bit version. (Imagine the reverse process, where the 16-bit value is truncated or rounded to 8 bits.)
My evaluation function is trying to get the minimum noise floor. I should be able to judge that with one or more FFTs.
Exhaustive testing would probably take forever, so I could take a lower-resolution first pass. Or do I just randomly push randomly chosen values around (within the known values that would keep the same 8-bit version) and do the evaluation and keep the cleaner version? Or is there something faster I can do? Am I in danger of falling into local minimums when there might be some better minimums elsewhere in the search space? I've had that happen in other similar situations.
Are there any initial guesses I can make, maybe by looking at neighboring values?
Edit: Several people have pointed out that the problem is easier if I remove the requirement that the new waveform would sample to the original. That's true. In fact, if I'm just looking for cleaner sounds, the solution is trivial.
You could put your existing 8-bit sample into the high-order byte of your new 16-bit sample, and then use the low order byte to linear interpolate some new 16 bit datapoints between each original 8-bit sample.
This would essentially connect a 16 bit straight line between each of your original 8-bit samples, using several new samples. It would sound much quieter than what you have now, which is a sudden, 8-bit jump between the two original samples.
You could also try apply some low-pass filtering.
Going with the approach in your question, I would suggest looking into hill-climbing algorithms and the like.
http://en.wikipedia.org/wiki/Hill_climbing
has more information on it and the sidebox has links to other algorithms which may be more suitable.
AI is like alchemy - we never reached the final goal, but lots of good stuff came out along the way.
Well, I would expect some FIR filtering (IIR if you really need processing cycles, but FIR can give better results without instability) to clean up the noise. You would have to play with it to get the effect you want but the basic problem is smoothing out the sharp edges in the audio created by sampling it at 8 bit resolutions. I would give a wide birth to the center frequency of the audio and do a low pass filter, and then listen to make sure I didn't make it sound "flat" with the filter I picked.
It's tough though, there is only so much you can do, the lower 8 bits is lost, the best you can do is approximate it.
It's almost impossible to get rid of noise that looks like your signal. If you start tweeking stuff in your frequency band it will take out the signal of interest.
For upsampling, since you're already using an FFT, you can add zeros to the end of the frequency domain signal and do an inverse FFT. This completely preserves the frequecy and phase information of the original signal, although it spreads the same energy over more samples. If you shift it 8bits to be a 16bit samples first, this won't be a too much of a problem. But I usually kick it up by an integer gain factor before doing the transform.
Pete
Edit:
The comments are getting a little long so I'll move some to the answer.
The peaks in the FFT output are harmonic spikes caused by the quantitization. I tend to think of them differently than the noise floor. You can dither as someone mentioned and eliminate the amplitude of the harmonic spikes and flatten out the noise floor, but you loose over all signal to noise on the flat part of your noise floor. As far as the FFT is concerned. When you interpolate using that method, it retains the same energy and spreads over more samples, this reduces the amplitude. So before doing the inverse, give your signal more energy by multipling by a gain factor.
Are the signals simple/complex sinusoids, or do they have hard edges? i.e. Triangle, square waves, etc. I'm assuming they have continuity from cycle to cycle, is that valid? If so you can also increase your FFT resolution to more precisely pinpoint frequencies by increasing the number of waveform cycles fed to your FFT. If you can precisely identify the frequencies use, assuming they are somewhat discrete, you may be able to completely recreate the intended signal.
The 16-bit to 8-bit via truncation requirement will produce results that do not match the original source. (Thus making finding an optimal answer more difficult.) Typically you would produce a fixed point waveform by attempting to "get the closest match" that means rounding to the nearest number (trunking is a floor operation). That is most likely how they were originally generated. Adding 0.5 (in this case 0.5 is 128) and then trunking the output would allow you to generate more accurate results. If that's not a worry then ok, but it definitely will have a negative effect on accuracy.
UPDATED:
Why? Because the goal of sampling a signal is to be able to as close a possible reproduce the signal. If conversion threshold is set poorly on the sampling all you're error is to one side of signal and not well distributed and centered about zero. On such systems you typically try to maximize the use the availiable dynamic range, particularly if you have low resolution such as an 8-bit ADC.
Band limited versions? If they are filtered at different frequencies, I'd suspect it was to allow you to play the same sound with out distortions when you went too far out from the other variation. Kinda like mipmapping in graphics.
I suspect the two are the same signal with different aliasing filters applied, this may be useful in reproducing the original. They should be the same base signal with different convolutions applied.
There might be a simple approach taking advantange of the periodicity of the waveforms. How about if you:
Make a 16-bit waveform where the high bytes are the waveform and the low bytes are zero - call it x[n].
Calculate the discrete Fourier transform of x[n] = X[w].
Make a signal Y[w] = (dBMag(X[w]) > Threshold) ? X[w] : 0, where dBMag(k) = 10*log10(real(k)^2 + imag(k)^2), and Threshold is maybe 40 dB, based on 8 bits being roughly 48 dB dynamic range, and allowing ~1.5 bits of noise.
Inverse transform Y[w] to get y[n], your new 16 bit waveform.
If y[n] doesn't sound nice, dither it with some very low level noise.
Notes:
A. This technique only works in the original waveforms are exactly periodic!
B. Step 5 might be replaced with setting the "0" values to random noise in Y[w] in step 3, you'd have to experiment a bit to see what works better.
This seems easier (to me at least) than an optimization approach. But truncated y[n] will probably not be equal to your original waveforms. I'm not sure how important that constraint is. I feel like this approach will generate waveforms that sound good.

Pitch recognition of musical notes on a smart phone

With limited resources such as slower CPUs, code size and RAM, how best to detect the pitch of a musical note, similar to what an electronic or software tuner would do?
Should I use:
Kiss FFT
FFTW
Discrete Wavelet Transform
autocorrelation
zero crossing analysis
octave-spaced filters
other?
In a nutshell, what I am trying to do is to recognize a single musical note, two octaves below middle-C to two octaves above, played on any (reasonable) instrument. I'd like to be within 20% of the semitone - in other words, if the user plays too flat or too sharp, I need to distinguish that. However, I will not need the accuracy required for tuning.
If you don't need that much accuracy, an FFT could be sufficient. Window the chunk of audio first so that you get well-defined peaks, then find the first significant peak.
Bin width = sampling rate / FFT size:
Fundamentals range from 20 Hz to 7 kHz, so a sampling rate of 14 kHz would be enough. The next "standard" sampling rate is 22050 Hz.
The FFT size is then determined by the precision you want. FFT output is linear in frequency, while musical tones are logarithmic in frequency, so the worst case precision will be at low frequencies. For 20% of a semitone at 20 Hz, you need a width of 1.2 Hz, which means an FFT length of 18545. The next power of two is 215 = 32768. This is 1.5 seconds of data, and takes my laptop's processor 3 ms to calculate.
This won't work with signals that have a "missing fundamental", and finding the "first significant" peak is somewhat difficult (since harmonics are often higher than the fundamental), but you can figure out a way that suits your situation.
Autocorrelation and harmonic product spectrum are better at finding the true fundamental for a wave instead of one of the harmonics, but I don't think they deal as well with inharmonicity, and most instruments like piano or guitar are inharmonic (harmonics are slightly sharp from what they should be). It really depends on your circumstances, though.
Also, you can save even more processor cycles by computing only within a specific frequency band of interest, using the Chirp-Z transform.
I've written up a few different methods in Python for comparison purposes.
If you want to do pitch recognition in realtime (and accurate to within 1/100 of a semi-tone), your only real hope is the zero-crossing approach. And it's a faint hope, sorry to say. Zero-crossing can estimate pitch from just a couple of wavelengths of data, and it can be done with a smartphone's processing power, but it's not especially accurate, as tiny errors in measuring the wavelengths result in large errors in the estimated frequency. Devices like guitar synthesizers (which deduce the pitch from a guitar string with just a couple of wavelengths) work by quantizing the measurements to notes of the scale. This may work for your purposes, but be aware that zero-crossing works great with simple waveforms, but tends to work less and less well with more complex instrument sounds.
In my application (a software synthesizer that runs on smartphones) I use recordings of single instrument notes as the raw material for wavetable synthesis, and in order to produce notes at a particular pitch, I need to know the fundamental pitch of a recording, accurate to within 1/1000 of a semi-tone (I really only need 1/100 accuracy, but I'm OCD about this). The zero-crossing approach is much too inaccurate for this, and FFT-based approaches are either way too inaccurate or way too slow (or both sometimes).
The best approach that I've found in this case is to use autocorrelation. With autocorrelation you basically guess the pitch and then measure the autocorrelation of your sample at that corresponding wavelength. By scanning through the range of plausible pitches (say A = 55 Hz thru A = 880 Hz) by semi-tones, I locate the most-correlated pitch, then do a more finely-grained scan in the neighborhood of that pitch to get a more accurate value.
The approach best for you depends entirely on what you're trying to use this for.
I'm not familiar with all the methods you mention, but what you choose should depend primarily on the nature of your input data. Are you analysing pure tones, or does your input source have multiple notes? Is speech a feature of your input? Are there any limitations on the length of time you have to sample the input? Are you able to trade off some accuracy for speed?
To some extent what you choose also depends on whether you would like to perform your calculations in time or in frequency space. Converting a time series to a frequency representation takes time, but in my experience tends to give better results.
Autocorrelation compares two signals in the time domain. A naive implementation is simple but relatively expensive to compute, as it requires pair-wise differencing between all points in the original and time-shifted signals, followed by differentiation to identify turning points in the autocorrelation function, and then selection of the minimum corresponding to the fundamental frequency. There are alternative methods. For example, Average Magnitude Differencing is a very cheap form of autocorrelation, but accuracy suffers. All autocorrelation techniques run the risk of octave errors, since peaks other than the fundamental exist in the function.
Measuring zero-crossing points is simple and straightforward, but will run into problems if you have multiple waveforms present in the signal.
In frequency-space, techniques based on FFT may be efficient enough for your purposes. One example is the harmonic product spectrum technique, which compares the power spectrum of the signal with downsampled versions at each harmonic, and identifies the pitch by multiplying the spectra together to produce a clear peak.
As ever, there is no substitute for testing and profiling several techniques, to empirically determine what will work best for your problem and constraints.
An answer like this can only scratch the surface of this topic. As well as the earlier links, here are some relevant references for further reading.
Summary of pitch detection algorithms (Wikipedia)
Pros and cons of Autocorrelation vs Harmonic Product Spectrum
A high-level overview of pitch detection methods
In my project danstuner, I took code from Audacity. It essentially took an FFT, then found the peak power by putting a cubic curve on the FFT and finding the peak of that curve. Works pretty well, although I had to guard against octave-jumping.
See Spectrum.cpp.
Zero crossing won't work because a typical sound has harmonics and zero-crossings much more than the base frequency.
Something I experimented with (as a home side project) was this:
Sample the sound with ADC at whatever sample rate you need.
Detect the levels of the short-term positive and negative peaks of the waveform (sliding window or similar). I.e. an envelope detector.
Make a square wave that goes high when the waveform goes within 90% (or so) of the positive envelope, and goes low when the waveform goes within 90% of the negative envelope. I.e. a tracking square wave with hysteresis.
Measure the frequency of that square wave with straight-forward count/time calculations, using as many samples as you need to get the required accuracy.
However I found that with inputs from my electronic keyboard, for some instrument sounds it managed to pick up 2× the base frequency (next octave). This was a side project and I never got around to implementing a solution before moving on to other things. But I thought it had promise as being much less CPU load than FFT.

Resources