Changing the bit depth of audio with VHDL to use a codec - audio

I'm trying to use the audio codec given in the Xilinx virtex 5 - ML506 board, which works with samples of 20 bits length. The problem is as follows:
My samples are 8 bits length so I have tried to play them by setting them on the more significant bits of the codec input (that's codec_input <= my_sample & "000000000000" ). But as result it plays the audio which It was supposed to play (in a understandable way) plus a significant noise.
I have read somewhere that the codec input should be filled with the sample, so I tried it by doing codec_input <= my_sample * "111111111111", but it worked in the same way.
The codec is working properly, i proved it playing samples of 20 bits length, but I need it to reproduce 8 bit length ones.
So if some of you have some advice or any suggestion... I would thank you very much.
cheers!
EDIT: I have tried making the sample the LSB of the codec input and it didn't work.

So you want to:
Use the 8 bits of data you have as the MS bits
Duplicate them to the next byte down
Duplicate them again (presumably the high nibble as there are only 4 bits left to fill?)
Use the & operator to concatenate your bits together like so:
codec_input <= sample & sample & sample(7 downto 4);
I'm not sure that'll sound any better, but I think that's what you asked for.

Related

How is WAV data stored in principle? [duplicate]

This question already has answers here:
What do the bytes in a .wav file represent?
(6 answers)
Closed last year.
Each WAV files depends on a Sampling Rate and a Bit Depth. The former governs how many different samples are played per second, and the latter governs how many possibilities there are for each timeslot.
For sampling rate for example 1000 Hz and the bit depth is 8 then each 1/1000 of a second the audio device plays one of a possible $2^8$ different sounds.
Hence the bulk of the WAV file is a sequence of 8-bit numbers. There is also a header which contains the Sampling Rate and Bit Depth and other specifics of how the data should be read:
The above comes from running xxd on a wav file to view it in binary on the terminal. The first column is just increments of 6 in hexadecimal. The last one seems to say where the header ends. So the data looks like this:
Each of those 8-bit numbers is a sample. So the device reads left-to right and converts the samples in order into sounds. But how in principle can each number correspond to a sound. I would think each bit should somehow encode an amplitude and a pitch, with each coming from a finite range. But I can not find any reference to for example the first half of the bits being a pitch and the second being a frequency.
I have found references to the numbers encoding "signal strength" but I do not know what this means.Can anyone explain in principle how the data is read and converted to audio?
In your example, over the course of a second, 1000 values are sent to a DAC (Digital to Analog converter) where the discrete values are smoothed out into a waveform. The pitch is determined by the rate and pattern by which the stream of values (which get smoothed out to a wave) rise and fall.
Steve W. Smith gives some good diagrams and explanations in his chapter ADC and DCA from his very helpful book The Scientists and Engineers Guide to Digital Signal Processing.

Will this change in audio codec result in appreciable difference?

I used SciPy to run a butterworth pass, removing sounds above a certain frequency from an audio file. The SciPy package is fast and easy to use but unfortunately, lacking options in terms of specifying codec to be used in the output.
My original audio files were in PCM s16LE # 16 bits per sample. The output audio files are in 64 bits floats LE # 64 bits per sample. Will the change in codec have an appreciable impact on the way the audio files sound. Would I be able to keep the sound quality similar if I were to convert the output audio codec back to its original format?
Yes, converting the audio back to the original format of 16 bit integer should not cause audible quality loss.
The higher precision format might be useful as intermediate format for processing, but converting back to 16 bit integer format does not incur any extra audible noise.
See https://people.xiph.org/~xiphmont/demo/neil-young.html for further explanations on the matter. A few relevant quotes:
16 bits is enough to store all we can hear, and will be enough forever.
[...]
When does 24 bit matter?
Professionals use 24 bit samples in recording and production for headroom, noise floor, and convenience reasons.
16 bits is enough to span the real hearing range with room to spare. [...]
[...] Once the music is ready to distribute, there's no reason to keep more than 16 bits.

How is a 24-bit audio stream delivered to the graph?

This is probably a very silly question, but after searching for a while, I couldn't find a straight answer.
If a source filter (such as the LAV Audio codec) is processing a 24-bit integral audio stream, how are individual audio samples delivered to the graph?
(for simplicity lets consider a monophonic stream)
Are they stored individually on a 32-bit integer with the most-significant bits unused, or are they stored in a packed form, with the least significant bits of the next sample occupying the spare, most-significant bits of the current sample?
The format is similar to 16-bit PCM: the values are signed integers, little endian.
With 24-bit audio you normally define the format with the help of WAVEFORMATEXTENSIBLE structure, as opposed to WAVEFORMATEX (well, the latter is also possible in terms of being accepted by certain filters, but in general you are expected to use the former).
The structure has two values: number of bits per sample and number of valid bits per sample. So it's possible to have the 24-bit data represented as 24-bit values, and also as 24-bit meaningful bits of 32-bit values. The payload data should match the format.
There is no mix of bits of different samples within a byte:
However, wBitsPerSample is the container size and must be a multiple of 8, whereas wValidBitsPerSample can be any value not exceeding the container size. For example, if the format uses 20-bit samples, wBitsPerSample must be at least 24, but wValidBitsPerSample is 20.
To my best knowledge it's typical to have just 24-bit values, that is three bytes per PCM sample.
Non-PCM formats might define different packing and use "unused" bits more efficiently, so that, for example, to samples of 20-bit audio consume 5 bytes.

Does "16bit integer PCM data" mean it's signed or unsigned?

I'm using FMOD to develop an application which would immediately start playing the recording of the next/previous sentence exactly from its beginning in a MP3 file which contains speech, without music, when the user clicked the Next/Prev button. I got the PCM data of a mp3 file by calling Sound::lock, but Sound::getFormat only told me it was "16bit integer PCM data", without saying whether it was signed or unsigned. How would I know that?
Some articles on the Internet say that almost all 16-bit integer PCM data are signed. If my PCM data is signed, what range of values represent silence, those values close to 0 (e.g. -10 ~ 10), or the values close to -32768 (e.g. -32768 ~ -32750)? If they are the values close to 0, does this mean that there's no difference in meaning between opposite numbers like -32767 and 32767?
I need to detect silences which are long enough, e.g. longer than 500ms, to determine where each sentence in the speech begins.
Could anyone give me any suggestions on how to detect silence between sentences?
16-bit audio is, by convention, usually signed.
Think about what PCM audio is: each measure is how far along its axis the speaker should physically rest at that moment in time. Therefore perfect silence is absolutely any repeating value — that represents the speaker not moving.
0 is then the centre of the range, and usually where a microphone should be with no input. -32768 is the speaker as close to one end of its axis as it can be, 32767 is it at the other end.
The safest way to detect silence would be to run a spectral analysis over the relevant range and look for periods where there is no activity in any audible frequency range.
If you're looking for pauses between speech then the easiest thing would probably be to go to somewhere like this, plug in an acceptable frequency range for speech (it's considered to be around 300Hz to around 3500Hz in telephony), your sampling rate and however many multiplications you think you can afford. Copy the coefficients supplied. E.g. I assumed you'll do 37 taps across the speech range with a 44100Hz input and, converted to a C array, I got:
double coefficients[] = {
-0.000560, -0.001290, -0.002332, -0.003606, -0.004911, -0.005921, -0.006201,
-0.005256, -0.002610, 0.002106, 0.009059, 0.018139, 0.028924, 0.040691, 0.052479,
0.063203, 0.071794, 0.077351, 0.079274, 0.077351, 0.071794, 0.063203, 0.052479,
0.040691, 0.028924, 0.018139, 0.009059, 0.002106, -0.002610, -0.005256, -0.006201,
-0.005921, -0.004911, -0.003606, -0.002332, -0.001290, -0.000560};
If it were double input, for each input sample c I'd then compute a sampled value:
double *inputWave = ... input, an infinite array for the purposes of the example ...
double sampledValue = 0.0;
for(size_t coeff = 0; coeff < numberOfTaps; coeff++) {
sampledValue += coefficients[coeff] * inputWave[c + coeff];
}
// (where numberOfTaps = sizeof(coefficients) / sizeof(coefficients[0]),
// i.e. the number of coefficients: 37 with the array given above)
What I've then got is a bandpass filter. Only that part of the signal representing sound in the frequency range 300–3500Hz should remain in the output values. In real life no such filter is perfect; increase the number of coefficients to increase the quality of your filter.
Having cut irrelevant parts of the signal I could then look for prolonged periods of sampledValue = [close to] 0.0.
Surprisingly if I create directsound soundbuffers with 8Bit format, directsound expects the samples to be 8Bit SIGNED (-127 - 127) on my machine while when I create a 16Bit buffer directsound expects them to be 16Bit UNSIGNED (0 - 65535). So at least on my machine the standard seems to be the opposite of Tommy's answer.

Mixing PCM audio samples

I have a simple question regarding mixing multiple PCM samples.
I read that best way to mix multiple audio PCM samples is to take the average of the samples each frame.
So if I am adding together say 5 16 bit samples before dividing by 5, there is obviously a good chance it will have a value greater than a 16bit short can hold.
So when mixing together multiple 16 bit samples, do I store them all in int first and add them and average them, then convert back to short?
If you want to mix audio samples you just add them together. Building an average is not the correct way to do this.
Think about it: If someone plays a violin and a second violin joins the music, will the first violin become less loud? No. It would not. The second violin just adds to the signal.
When adding PCM samples you have to deal with integer overflows. One way to do it is to have a global 'master volume' that gets applied to the mixed PCM sample. Using such a global multiplier can help you to make sure your final signal is mostly within the 16 bits of your output data.
You'll probably also want a per channel volume control.
In the end overflows will still occur here and there and the best way to deal with them is to clamp the output value to the maximum and minimum representable value of your 16 bit output stream. The ear will tolerate that and it will go unnoticed as long as it doesn't occur to often.
If you talk about mixing, I would suggest you to use floats.
Anyway, if you want to use shorts, you can use 32 or 64 bit integers or you simple divide the samples first and add them afterwards. That is possible since this
equals this

Resources