NodeJS: How to write a file parser using readStream? - node.js

I have a file in a binary format:
The format is as follows:
[4 - header bytes] [8 bytes - int64 - how many bytes to read following] [variable num of bytes (size of the int64) - read the actual information]
And then it repeats, so I must first read the first 12 bytes to determine how many more bytes I need to read.
I have tried:
var readStream = fs.createReadStream('/path/to/file.bin');
readStream.on('data', function(chunk) { ... })
The problem I have is that chunk always comes back in chunks of 65536 bytes at a time whereas I need to be more specific on the number of bytes that I am reading.
I have always tried readStream.on('readable', function() { readStream.read(4) })
But it is also not very flexible, because it seems to turn asynchronous code into synchronous code because, I have to put the 'reading' in a while loop
Or maybe readStream is not appropriate in this case and I should use this instead? fs.read(fd, buffer, offset, length, position, callback)

Here's what I'd recommend as an abstract handler of a readStream to process abstract data like you're describing:
var pending = new Buffer(9999999);
var cursor = 0;
stream.on('data', function(d) {
d.copy(pending, cursor);
cursor += d.length;
var test = attemptToParse(pending.slice(0, cursor));
while (test !== false) {
// test is a valid blob of data
processTheThing(test);
var rawSize = test.raw.length; // How many bytes of data did the blob actually take up?
pending.copy(pending.copy, 0, rawSize, cursor); // Copy the data after the valid blob to the beginning of the pending buffer
cursor -= rawSize;
test = attemptToParse(pending.slice(0, cursor)); // Is there more than one valid blob of data in this chunk? Keep processing if so
}
});
For your use-case, ensure the initialized size of the pending Buffer is large enough to hold the largest possible valid blob of data you'll be parsing (you mention an int64; that max size plus the header size) plus one extra 65536 bytes in case the blob boundary happens just on the edge of a stream chunk.
My method requires a attemptToParse() method that takes a buffer and tries to parse the data out of it. It should return false if the length of the buffer is too short (data hasn't come in enough yet). If it is a valid object, it should return some parsed object that has a way to show the raw bytes it took up (.raw property in my example). Then you do any processing you need to do with the blob (processTheThing()), trim out that valid blob of data, shift the pending Buffer to just be the remainder and keep going. That way, you don't have a constantly growing pending buffer, or some array of "finished" blobs. Maybe process on the receiving end of processTheThing() is keeping an array of the blobs in memory, maybe it's writing them to a database, but in this example, that's abstracted away so this code just deals with how to handle the stream data.

Add the chunk to a Buffer, and then parse the data from there. Being aware not to go beyond the end of the buffer (if your data is large). I'm using my tablet right now so can't add any example source code. Maybe somebody else can?
Ok, mini source, very skeletal.
var chunks = [];
var bytesRead= 0;
stream.on('data', function(chunk) {
chunks.push(chunk);
bytesRead += chunk.length;
// look at bytesRead...
var buffer = Buffer.concat(chunks);
chunks = [buffer]; // trick for next event
// --> or, if memory is an issue, remove completed data from the beginning of chunks
// work with the buffer here...
}

Related

IORedis: how to publish ArrayBuffer

I'm trying to publish an ArrayBuffer to a IORedis stream.
I do so as follow:
const ab = new ArrayBuffer(1); // ArrayBuffer of length = 1 byte
const dv = new DataView(ab);
dv.setInt8(0, 7); // Write the number 7 in the buffer
const buffer = Buffer.from(ab); // Convert to Buffer since that's what `publish` expects
redisPublisher.publish('buffer-test', buffer);
It's a toy example, in practice I'll want to encode complex stuff in the ArrayBuffer, not just a number. Anyway, then I try to read with
redisSubscriber.on('message', async (channel, data) => {
logger.info(`Redis message: channel: ${channel}, data: ${data}, ${typeof data}`);
// ... do something with it
});
The problem is that data is empty, and its type is considered as string. As per the documentation I tried redisSubscriber.on('messageBuffer', ... instead, but it behaves exactly the same, so much so that I'm failing to understand the difference between the two.
Also confusing is that if I encode a Buffer, e.g.
const buffer = Buffer.from("I'm a string!", 'utf-8');
redisPublisher.publish('buffer-test', buffer);
Upon reception, data will again be a string, decoded from the Buffer, which in that toy case is ok but generally is not for me. I'd like to send an Buffer in, containing more complex data that just a string (an ArrayBuffer in my case), and get a Buffer out, that I could properly parse based on my needs and not have automatically read as a string.
Any help is welcome!

Reading data a block at a time, synchronously

What is the nodejs (typescript) equivalent of the following Python snippet? I've put an attempt at corresponding nodejs below the Python.
Note that I want to read a chunk at a time (later that is, in this example I'm just reading the first kilobyte), synchronously.
Also, I do not want to read the entire file into virtual memory at once; some of my input files will (eventually) be too big for that.
The nodejs snippet always returns null. I want it to return a string or buffer or something along those lines. If the file is >= 1024 bytes long, I want a 1024 character long return, otherwise I want the entire file.
I googled about this for an hour or two, but all I found was things synchronously reading an entire file at a time, or reading pieces at a time asynchronously.
Thanks!
Here's the Python:
def readPrefix(filename: str) -> str:
with open(filename, 'rb') as infile:
data = infile.read(1024)
return data
Here's the nodejs attempt:
const readPrefix = (filename: string): string => {
const readStream = fs.createReadStream(filename, { highWaterMark: 1024 });
const data = readStream.read(1024);
readStream.close();
return data;
};
To read synchronously, you would use fs.openSync(), fs.readSync() and fs.closeSync().
Here's some regular Javascript code (hopefully you can translate it to TypeScript) that synchronously reads a certain number of bytes from a file and returns a buffer object containing those bytes (or throws an exception in case of error):
const fs = require('fs');
function readBytesSync(filePath, filePosition, numBytesToRead) {
const buf = Buffer.alloc(numBytesToRead, 0);
let fd;
try {
fd = fs.openSync(filePath, "r");
fs.readSync(fd, buf, 0, numBytesToRead, filePosition);
} finally {
if (fd) {
fs.closeSync(fd);
}
}
return buf;
}
For your application, you can just pass 1024 as the bytes to read and if there are less than that in the file, it will just read up until the end of the file. The returns buffer object will contain the bytes read which you can access as binary or convert to a string.
For the benefit of others reading this, I mentioned in earlier comments that synchronous I/O should never be used in a server environment (servers should always use asynchronous I/O except at startup time). Synchronous I/O can be used for stand-alone scripts that only do one thing (like build scripts, as an example) and don't need to be responsive to multiple incoming requests.
Do I need to loop on readSync() in case of EINTR or something?
Not that I'm aware of.

Nodejs Readable streams, parsing binary data, preserving order

Using latest nodejs...
Got a binary coming from mongodb (field within a document). Means I will be processing multiple binary payloads concurrently. Data is a media file (h264) made up of slices (nal units). Each slice is delimited.
Using a readable stream from fs if I act on "data" events is the order of the data chunks preserved? Can I be guaranteed to process the "data" in order? (See the origin in the path part of the "this" scope in each call)
The order that data is written to a stream is guaranteed to be the same order that it is read with. When writing to a stream, the data is either written or queued, order does not change. This is from the Node.js source:
function writeOrBuffer(stream, state, chunk, encoding, cb) {
chunk = decodeChunk(state, chunk, encoding);
if (util.isBuffer(chunk))
encoding = 'buffer';
var len = state.objectMode ? 1 : chunk.length;
state.length += len;
var ret = state.length < state.highWaterMark;
state.needDrain = !ret;
if (state.writing || state.corked)
state.buffer.push(new WriteReq(chunk, encoding, cb));
else
doWrite(stream, state, false, len, chunk, encoding, cb);
return ret;
}
This is also how data events are fired:
// if we want the data now, just emit it.
if (state.flowing && state.length === 0 && !state.sync) {
stream.emit('data', chunk);
stream.read(0);
}
The data event won't fire for a chunk unless there is no queued data, which means you will get the data in the order that it was passed in as.

Is http.ServerResponse.write() blocking?

Is it possible to write non-blocking response.write? I've written a simple test to see if other clients can connect while one downloads a file:
var connect = require('connect');
var longString = 'a';
for (var i = 0; i < 29; i++) { // 512 MiB
longString += longString;
}
console.log(longString.length)
function download(request, response) {
response.setHeader("Content-Length", longString.length);
response.setHeader("Content-Type", "application/force-download");
response.setHeader("Content-Disposition", 'attachment; filename="file"');
response.write(longString);
response.end();
}
var app = connect().use(download);
connect.createServer(app).listen(80);
And it seems like write is blocking!
Am I doing something wrong?
Update So, it doesn't block and it blocks in the same time. It doesn't block in the sense that two files can be downloaded simultaneously. And it blocks in the sense that creating a buffer is a long operation.
Any processing done strictly in JavaScript will block. response.write(), at least as of v0.8, is no exception to this:
The first time response.write() is called, it will send the buffered header information and the first body to the client. The second time response.write() is called, Node assumes you're going to be streaming data, and sends that separately. That is, the response is buffered up to the first chunk of body.
Returns true if the entire data was flushed successfully to the kernel buffer. Returns false if all or part of the data was queued in user memory. 'drain' will be emitted when the buffer is again free.
What may save some time is to convert longString to Buffer before attempting to write() it, since the conversion will occur anyways:
var longString = 'a';
for (...) { ... }
longString = new Buffer(longString);
But, it would probably be better to stream the various chunks of longString rather than all-at-once (Note: Streams are changing in v0.10):
var longString = 'a',
chunkCount = Math.pow(2, 29),
bufferSize = Buffer.byteLength(longString),
longBuffer = new Buffer(longString);
function download(request, response) {
var current = 0;
response.setHeader("Content-Length", bufferSize * chunkCount);
response.setHeader("Content-Type", "application/force-download");
response.setHeader("Content-Disposition", 'attachment; filename="file"');
function writeChunk() {
if (current < chunkCount) {
current++;
if (response.write(longBuffer)) {
process.nextTick(writeChunk);
} else {
response.once('drain', writeChunk);
}
} else {
response.end();
}
}
writeChunk();
}
And, if the eventual goal is to stream a file from disk, this can be even easier with fs.createReadStream() and stream.pipe():
function download(request, response) {
// response.setHeader(...)
// ...
fs.createReadStream('./file-on-disk').pipe(response);
}
Nope, it does not block, I tried one from IE and other from firefox. I did IE first but still could download file from firefox first.
I tried for 1 MB (i < 20) it works the same just faster.
You should know that whatever longString you create requires memory allocation. Try to do it for i < 30 (on windows 7) and it will throw FATAL ERROR: JS Allocation failed - process out of memory.
It takes time for memory allocation/copying nothing else. Since it is a huge file, the response is time taking and your download looks like blocking. Try it yourself for smaller values (i < 20 or something)

Nodejs: Set highWaterMark of socket object

is it possible to set the highWaterMark of a socket object after it was created:
var http = require('http');
var server = http.createServer();
server.on('upgrade', function(req, socket, head) {
socket.on('data', function(chunk) {
var frame = new WebSocketFrame(chunk);
// skip invalid frames
if (!frame.isValid()) return;
// if the length in the head is unequal to the chunk
// node has maybe split it
if (chunk.length != WebSocketFrame.getLength()) {
socket.once('data', listenOnMissingChunks);
});
});
});
function listenOnMissingChunks(chunk, frame) {
frame.addChunkToPayload(chunk);
if (WebSocketFrame.getLength()) {
// if still corrupted listen once more
} else {
// else proceed
}
}
The above code example does not work. But how do I do it instead?
Further explaination:
When I receive big WebSocket frames they get split into multiple data events. This makes it hard to parse the frames because I do not know if this is a splitted or corrupted frame.
I think you misunderstand the nature of a TCP socket. Despite the fact that TCP sends its data over IP packets, TCP is not a packet protocol. A TCP socket is simply a stream of data. Thus, it is incorrect to view the data event as a logical message. In other words, one socket.write on one end does not equate to a single data event on the other.
There are many reasons that a single write to a socket does not map 1:1 to a single data event:
The sender's network stack may combine multiple small writes into a single IP packet. (The Nagle algorithm)
An IP packet may be fragmented (split into multiple packets) along its journey if its size exceeds any one hop's MTU.
The receiver's network stack may combine multiple packets into a single data event (as seen by your application).
Because of this, a single data event might contain multiple messages, a single message, or only part of a message.
In order to correctly handle messages sent over a stream, you must buffer incoming data until you have a complete message.
var net = require('net');
var max = 1024 * 1024 // 1 MB, the maximum amount of data that we will buffer (prevent a bad server from crashing us by filling up RAM)
, allocate = 4096; // how much memory to allocate at once, 4 kB (there's no point in wasting 1 MB of RAM to buffer a few bytes)
, buffer=new Buffer(allocate) // create a new buffer that allocates 4 kB to start
, nread=0 // how many bytes we've buffered so far
, nproc=0 // how many bytes in the buffer we've processed (to avoid looping over the entire buffer every time data is received)
, client = net.connect({host:'example.com', port: 8124}); // connect to the server
client.on('data', function(chunk) {
if (nread + chunk.length > buffer.length) { // if the buffer is too small to hold the data
var need = Math.min(chunk.length, allocate); // allocate at least 4kB
if (nread + need > max) throw new Error('Buffer overflow'); // uh-oh, we're all full - TODO you'll want to handle this more gracefully
var newbuf = new Buffer(buffer.length + need); // because Buffers can't be resized, we must allocate a new one
buffer.copy(newbuf); // and copy the old one's data to the new one
buffer = newbuf; // the old, small buffer will be garbage collected
}
chunk.copy(buffer, nread); // copy the received chunk of data into the buffer
nread += chunk.length; // add this chunk's length to the total number of bytes buffered
pump(); // look at the buffer to see if we've received enough data to act
});
client.on('end', function() {
// handle disconnect
});
client.on('error', function(err) {
// handle errors
});
function find(byte) { // look for a specific byte in the buffer
for (var i = nproc; i < nread; i++) { // look through the buffer, starting from where we left off last time
if (buffer.readUInt8(i, true) == byte) { // we've found one
return i;
}
}
}
function slice(bytes) { // discard bytes from the beginning of a buffer
buffer = buffer.slice(bytes); // slice off the bytes
nread -= bytes; // note that we've removed bytes
nproc = 0; // and reset the processed bytes counter
}
function pump() {
var pos; // position of a NULL character
while ((pos = find(0x00)) >= 0) { // keep going while there's a NULL (0x00) somewhere in the buffer
if (pos == 0) { // if there's more than one NULL in a row, the buffer will now start with a NULL
slice(1); // discard it
continue; // so that the next iteration will start with data
}
process(buffer.slice(0,pos)); // hand off the message
slice(pos+1); // and slice the processed data off the buffer
}
}
function process(msg) { // here's where we do something with a message
if (msg.length > 0) { // ignore empty messages
// here's where you have to decide what to do with the data you've received
// experiment with the protocol
}
}
You don't need to. Incoming data will almost certainly be split across two or more reads: this is the nature of TCP and there is nothing you can do about it. Fiddling with obscure socket parameters certainly won't change it. And the data will be lit but certainly not corrupted. Just treat the socket as what it is: a byte stream.

Resources