Node.js buf.toString vs String.fromCharCode - node.js

I'm attempting to display the character í from 0xed (237).
String.fromCharCode yields the correct result:
String.fromCharCode(0xed); // 'í'
However, when using a Buffer:
var buf = new Buffer(1);
buf.writeUInt8(0xed,0); // <Buffer ed>
buf.toString('utf8'); // '?', same as buf.toString()
buf.toString('binary'); // 'í'
Using 'binary' with Buffer.toString is to be deprecated so I want to avoid this.
Second, I can also expect incoming data to be multibyte (i.e. UTF-8), e.g.:
String.fromCharCode(0x0512); // Ԓ - correct
var buf = new Buffer(2);
buf.writeUInt16LE(0x0512,0); // <Buffer 12 05>, [0x0512 & 0xff, 0x0512 >> 8]
buf.toString('utf8'); // Ԓ - correct
buf.toString('binary'); // Ô
Note that both examples are inconsistent.
SO, what am I missing? What am I assuming that I shouldn't? Is String.fromCharCode magical?

Seems you might be assuming that Strings and Buffers use the same bit-length and encoding.
JavaScript Strings are 16-bit, UTF-16 sequences while Node's Buffers are 8-bit sequences.
UTF-8 is also a variable byte-length encoding, with code points consuming between 1 and 6 bytes. The UTF-8 encoding of í, for example, takes 2 bytes:
> new Buffer('í', 'utf8')
<Buffer c3 ad>
And, on its own, 0xed is not a valid byte in UTF-8 encoding, thus the ? representing an "unknown character." It is, however, a valid UTF-16 code for use with String.fromCharCode().
Also, the output you suggest for the 2nd example doesn't seem correct.
var buf = new Buffer(2);
buf.writeUInt16LE(0x0512, 0);
console.log(buf.toString('utf8')); // "\u0012\u0005"
You can detour with String.fromCharCode() to see the UTF-8 encoding.
var buf = new Buffer(String.fromCharCode(0x0512), 'utf8');
console.log(buf); // <Buffer d4 92>

Related

Encode LINEAR16 audio to Twilio media audio/x-mulaw | NodeJS

I have been trying to stream mulaw media stream back to Twilio. Requirement is payload must be encoded audio/x-mulaw with a sample rate of 8000 and base64 encoded
My input is from #google-cloud/text-to-speech in LINEAR16 Google Docs
I tried Wavefile
This is how I encoded the response from #google-cloud/text-to-speech
const wav = new wavefile.WaveFile(speechResponse.audioContent)
wav.toBitDepth('8')
wav.toSampleRate(8000)
wav.toMuLaw()
Then I send the result back to Twilio via WebSocket
twilioWebsocket.send(JSON.stringify({
event: 'media',
media: {
payload: wav.toBase64(),
},
streamSid: meta.streamSid,
}))
Problem is we only hear random noise on other ends of Twilio call, seems like encoding is not proper
Secondly I have checked the #google-cloud/text-to-speech output audio by saving it in a file and it was proper and clear
Can anyone please help me with the encoding
I also had this same problem. The error is in wav.toBase64(), as this includes the wav header. Twilio media streams expects raw audio data, which you can get with wav.data.samples, so your code would be:
const wav = new wavefile.WaveFile(speechResponse.audioContent)
wav.toBitDepth('8')
wav.toSampleRate(8000)
wav.toMuLaw()
const payload = Buffer.from(wav.data.samples).toString('base64');
I just had the same Problem. The solution is, that you need to convert the LINEAR16 by hand to the corresponding MULAW Codec.
You can use the code from a music libary.
I created a function out of this to convert a linear16 byte array to mulaw:
short2ulaw(b: Buffer): Buffer {
// Linear16 to linear8 -> buffer is half the size
// As of LINEAR16 nature, the length should ALWAYS be even
const returnbuffer = Buffer.alloc(b.length / 2)
for (let i = 0; i < b.length / 2; i++) {
// The nature of javascript forbids us to use 16-bit types. Every number is
// A double precision 64 Bit number.
let short = b.readInt16LE(i * 2)
let sign = 0
// Determine the sign of the 16-Bit byte
if (short < 0) {
sign = 0x80
short = short & 0xef
}
short = short > 32635 ? 32635 : short
const sample = short + 0x84
const exponent = this.exp_lut[sample >> 8] & 0x7f
const mantissa = (sample >> (exponent + 3)) & 0x0f
let ulawbyte = ~(sign | (exponent << 4) | mantissa) & 0x7f
ulawbyte = ulawbyte == 0 ? 0x02 : ulawbyte
returnbuffer.writeUInt8(ulawbyte, i)
}
return returnbuffer
}
Now you could use this on Raw PCM (Linear16). Now you just need to consider to strip the bytes at the beginning of the google stream since google adds a wav header.
You can then encode the resulting base64 buffer and send this to twilio.

What encoding does nodejs use for arguments in child_process.spawn and child_process.execFile?

In NodeJS, child_process.execFile and .spawn take this parameter:
args <string[]> List of string arguments.
How does NodeJS encode the strings you pass in this array?
Context: I'm writing a nodejs app which adds metadata (often including non-ascii characters) to an mp3.
I know that ffmpeg expects utf8-encoded arguments. If my nodejs app invokes child_process.execFile("ffmpeg",["-metadata","title="+myString], {encoding:"utf8") then how will nodejs encode myString in the arguments?
I know that id3v2 expects latin1-encoded arguments. If my nodejs app invokes child_process.execFile("id3v2",["--titl",myString], {encoding:"latin1") then how will nodejs encode myString in the arguments?
I see that execFile and spawn both take an "encoding" argument. But the nodejs docs say "The encoding option can be used to specify the character encoding used to decode the stdout and stderr output." The docs say nothing about the encoding of args.
Answer: NodeJS always encodes the args as UTF-8.
I wrote a simplistic C++ app which shows the raw truth of the bytes that are passed into its argv:
#include <stdio.h>
int main(int argc, char *argv[])
{
printf("argc=%u\n", argc);
for (int i = 0; i < argc; i++)
{
printf("%u:\"", i);
for (char *c = argv[i]; *c != 0; c++)
{
if (*c >= 32 && *c < 127)
printf("%c", *c);
else
{
unsigned char d = *(unsigned char *)c;
unsigned int e = d;
printf("\\x%02X", e);
}
}
printf("\"\n");
}
return 0;
}
Within my NodeJS app, I got some strings that I assuredly knew what they came from:
const a = Buffer.from([65]).toString("utf8");
const pound = Buffer.from([0xc2, 0xa3]).toString("utf8");
const skull = Buffer.from([0xe2, 0x98, 0xa0]).toString("utf8");
const pound2 = Buffer.from([0xa3]).toString("latin1");
The argument of toString indicates that the raw bytes in the buffer should be understood as if the buffer is in UTF-8 (or latin1 in the last case). The result is that I have four strings whose contents I unambiguously know is correct.
(I understand that Javascript VMs typically store their strings as UTF16? The fact that pound and pound2 behave the same in my experiments proves that the provenance of the strings doesn't matter.)
Finally I invoked execFile with these strings:
child_process.execFileAsync("argcheck",[a,pound,pound2,skull],{encoding:"utf8"});
child_process.execFileAsync("argcheck",[a,pound,pound2,skull],{encoding:"latin1"});
In both cases, the raw bytes that nodejs passed into argv were UTF-8 encodings of the strings a,pound,pound2,skull.
So how can we pass latin1 arguments from nodejs?
The above explanation shows it's IMPOSSIBLE for nodejs to pass in any latin1 character in the range 127..255 to child_process.spawn/execFile. But there's an escape hatch involving child_process.exec:
Example: this string "A £ ☠"
stored internally in Javascript's UTF16 as "\u0041 \u00A3 \u2620"
encoded in UTF-8 as "\x41 \xC2\xA3 \xE2\x98\xA0"
encoded in latin1 as "\x41 \xA3 ?" (the skull-and-crossbones is inexpressible in latin1)
Unicode chars 0-127 are same as latin1, and encode into utf-8 the same as latin1
Unicode chars 128-255 are same as latin1, but encode differently
Unicode chars 256+ don't exist in latin1/.
// this would encode them as utf8, which is wrong:
execFile("id3v2", ["--comment", "A £ ☠", "x.mp3"]);
// instead we'll use shell printf to bypass nodejs's wrongful encoding:
exec("id3v2 --comment \"`printf "A \xA3 ?"`\" x.mp3");
Here's a handy way to turn a string like "A £ ☠" into one like "A \xA3 ?", ready to pass into child_process.exec:
const comment2 = [...comment]
.map(c =>
c <= "\u007F" ? c : c <= "\u00FF"
? `\\x${("000" + c.charCodeAt(0).toString(16)).substr(-2)}` : "?")
)
.join("");
const cmd = `id3v2 --comment \"\`printf \"${comment2}\"\`\" \"${fn}\"`;
child_process.exec(cmd, (e, stdout, stderr) => { ... });

Buffers filled with unicode zeroes

I'm trying to synchronously read parameters from console in node, I managed to do the following:
var load = function () {
const BUFFER_LENGTH = 1024;
const stdin = fs.openSync('/dev/stdin', 'rs');
const buffer = Buffer.alloc(BUFFER_LENGTH);
console.log('Provide parameter: ');
fs.readSync(stdin, buffer, 0, BUFFER_LENGTH);
fs.closeSync(stdin);
return buffer.toString().replace(/\n*/, '');
}
It works, but here's a strange thing:
var loadedValue = load();
console.log(loadedValue); // displays "a", if I typed "a", so the result is correct
console.log({loadedValue}); // displays {a: 'a\n\u0000\u0000....'}
When I wrap the value in an object, the remaining BUFFER bits are showed in a string. Why is that? How can I get rid of them? Regexp on a string before making an object doesn't work.
Buffer.alloc(BUFFER_LENGTH) creates a buffer of a particular length (1024 in your case), and fills that buffer with NULL characters (as documented here).
Next, you read some (say 2) bytes from stdin into that buffer, which replaces the first two of those NULL characters with the characters read from stdin. The rest of the buffer still consists of NULL's.
If you don't truncate the buffer to the amount of bytes read, your function returns a buffer of length 1024, mostly filled with NULL's. Since those aren't printable, they don't show up in the first console.log(), but they're still there.
So after reading from stdin, you should truncate the buffer to the right size:
let bytesRead = fs.readSync(stdin, buffer, 0, BUFFER_LENGTH);
buffer = buffer.slice(0, bytesRead);

Writing binary data to Buffer

Normally, I would expect that the following would be good enough to represent binary data in a Buffer:
new Buffer('01001000','binary')
but I am pretty certain Node.js/JS does not support this 'binary' encoding.
What is the best way then to write binary data to a buffer?
You can do binary encoding like this:
var binaryString = "\xff\xfa\xc3\x4e";
var buffer = new Buffer(binaryString, "binary");
console.log(buffer);
<Buffer ff fa c3 4e>
//types of encoding allowed
encoding size (bytes)
base64 4,177,241
binary 4,162,398
hex 4,669,965
JSON 2,271,670
utf16le* 4,543,605
utf8* 3,640,132
ascii* 2,929,850

Convert ieee754 to decimal in node

I have a buffer in node <Buffer 42 d9 00 00> that is supposed to represent the decimal 108.5. I am using this module to try and decode the buffer: https://github.com/feross/ieee754.
ieee754.read = function (buffer, offset, isLE, mLen, nBytes)
The arguments mean the following:
buffer = the buffer
offset = offset into the buffer
value = value to set (only for write)
isLe = is little endian?
mLen = mantissa length
nBytes = number of bytes
I try to read the value: ieee754.read(buffer, 0, false, 5832704, 4) but am not getting the expected result. I think I am calling the function correctly, although I am unsure about the mLen argument.
[I discovered that] the node Buffer class has that ability built in: buffer.readFloatBE(0).

Resources