Reading data a block at a time, synchronously - node.js

What is the nodejs (typescript) equivalent of the following Python snippet? I've put an attempt at corresponding nodejs below the Python.
Note that I want to read a chunk at a time (later that is, in this example I'm just reading the first kilobyte), synchronously.
Also, I do not want to read the entire file into virtual memory at once; some of my input files will (eventually) be too big for that.
The nodejs snippet always returns null. I want it to return a string or buffer or something along those lines. If the file is >= 1024 bytes long, I want a 1024 character long return, otherwise I want the entire file.
I googled about this for an hour or two, but all I found was things synchronously reading an entire file at a time, or reading pieces at a time asynchronously.
Thanks!
Here's the Python:
def readPrefix(filename: str) -> str:
with open(filename, 'rb') as infile:
data = infile.read(1024)
return data
Here's the nodejs attempt:
const readPrefix = (filename: string): string => {
const readStream = fs.createReadStream(filename, { highWaterMark: 1024 });
const data = readStream.read(1024);
readStream.close();
return data;
};

To read synchronously, you would use fs.openSync(), fs.readSync() and fs.closeSync().
Here's some regular Javascript code (hopefully you can translate it to TypeScript) that synchronously reads a certain number of bytes from a file and returns a buffer object containing those bytes (or throws an exception in case of error):
const fs = require('fs');
function readBytesSync(filePath, filePosition, numBytesToRead) {
const buf = Buffer.alloc(numBytesToRead, 0);
let fd;
try {
fd = fs.openSync(filePath, "r");
fs.readSync(fd, buf, 0, numBytesToRead, filePosition);
} finally {
if (fd) {
fs.closeSync(fd);
}
}
return buf;
}
For your application, you can just pass 1024 as the bytes to read and if there are less than that in the file, it will just read up until the end of the file. The returns buffer object will contain the bytes read which you can access as binary or convert to a string.
For the benefit of others reading this, I mentioned in earlier comments that synchronous I/O should never be used in a server environment (servers should always use asynchronous I/O except at startup time). Synchronous I/O can be used for stand-alone scripts that only do one thing (like build scripts, as an example) and don't need to be responsive to multiple incoming requests.
Do I need to loop on readSync() in case of EINTR or something?
Not that I'm aware of.

Related

How can I limit the size of WriteStream buffer in NodeJS?

I'm using a WriteStream in NodeJS to write several GB of data, and I've identified the write loop as eating up ~2GB of virtual memory during runtime (which is the GC'd about 30 seconds after the loop finishes). I'm wondering how I can limit the size of the buffer node is using when writing the stream so that Node doesn't use up so much memory during that part of the code.
I've reduced it to this trivial loop:
let ofd = fs.openSync(fn, 'w')
let ws = fs.createWriteStream('', { fd: ofd })
:
while { /*..write ~4GB of binary formatted 32bit floats and uint32s...*/ }
:
:
ws.end()
The stream.write function will return a boolean value which indicate if the internal buffer is full. The buffer size is controlled by the option highWaterMark. However, this option is a threshold instead of a hard limitation, which means you can still call stream.write even if the internal buffer is full, and the memory will be used continuously if you code like this.
while (foo) {
ws.write(bar);
}
In order to solve this issue, you have to handle the returned value false from the ws.write and waiting until the drain event of this stream is called like the following example.
async function write() {
while (foo) {
if (!ws.write(bar)) {
await new Promise(resolve => ws.once('drain', resolve));
}
}
}

WriteStream nodejs out memory

I try to create a 20MB file, but it throws the error out of memory, set the max-old-space-size to 2gb, but still can someone explain to me why writing a 20mb stream consumes so much memory?
I have 2.3 g.b of free memory
let size=20*1024*1024; //20MB
for(let i=0;i<size;i++){
writeStream.write('A')
}
writeStream.end();
As mentioned in node documentation, Writable stores data in an internal buffer. The amount of data that can be buffered depends on highWaterMark option passed into the stream's constructor.
As long as size of buffered data is below below highWaterMark, calls to Writable.write(chunk) will return true. Once the buffered data exceeds limit specified by highWaterMark it returns false. This is when you should stop writing more data to Writable and wait for drain event which indicates that it's now appropriate to resume writing data.
Your program crashes because it keeps writing even when the internal buffer has exceeded highWaterMark.
Check the docs about Event:'drain'. It includes an example program.
This looks like a nice use case for Readable.pipe(Writable)
You can create a generator function that returns a character and then create a Readable from that generator by using Readable.from(). Then pipe the output of Readable to a Writable file.
The reason why it's beneficial to use pipe here is that :
A key goal of the stream API, particularly the stream.pipe() method,
is to limit the buffering of data to acceptable levels such that
sources and destinations of differing speeds will not overwhelm the
available memory. link
and
The flow of data will be automatically managed so that the destination
Writable stream is not overwhelmed by a faster Readable stream. link
const { Readable } = require('stream');
const fs = require('fs');
const size = 20 * 1024 * 1024; //20MB
function * generator(numberOfChars) {
while(numberOfChars--) {
yield 'A';
}
}
const writeStream = fs.createWriteStream('./output.txt');
const readable = Readable.from(generator(size));
readable.pipe(writeStream);

How to read a file by setting correct offset and position and write to the response in Nodejs with manual buffering?

I want to read a file in 64byte interval. And I also do not want to use any functionality which interanlly implements buffering. I wanted to do buffering manually. So I started using fs.read(). I tried hard but I really don't know how to set position which tells where to read from in the file and offset in the buffer to start writing at.So I found few resources and started implementing by my own. But what I did seems enterly wrong. Please find my code below.
app.get('/manualBufferAnother', function (req, res, next) {
var filePath = path.join(__dirname, 'Koala.jpg');
console.log("FilePath is: "+filePath);
var fileName = path.basename(filePath);
var mimeType = mime.lookup(filePath);
var stat = fs.statSync(filePath);
res.writeHead(200, {
"Content-Type": mimeType,
"Content-Disposition" : "attachment; filename=" + fileName,
'connection': 'keep-alive',
"Content-Length": stat.size,
"Transfer-Encoding": "chunked"
});
fs.open(filePath, 'r', function(err, fd) {
var completeBufferSize = stat.size;
var offset = 0; //is the offset in the buffer to start writing at
var length = 511; //is an integer specifying the number of bytes to read
var position = 0; //is an integer specifying where to begin reading
from in the file. If position is null, data will be read from the current file position
var buffer = new Buffer(completeBufferSize);
buf(res,fd,offset,position,length,buffer,stat);
});
});
var buf = function(res,fd,offset,position,length,buffer,stat) {
if(position+buffer.length < length) {
fs.read(fd,buffer,offset,length,position,function(error,bytesRead,bufferr {
res.write(bufferr.slice(0,bytesRead));
console.log("Bytes Read: "+bytesRead);
position=position+bufferr.length;
buf(res,fd,offset,position,length,bufferr,stat);
})
} else {
fs.read(fd,buffer,offset,length,position,function(error,bytesRead,bufferr) {
console.log("Bytes Read in else: "+bytesRead);
res.end(bufferr.slice(0,bytesRead));
fs.close(fd)
})
}
}
I know this code is doing so much wrong thing. But I don't know the right way.
Should I use any loop for setting and storing position and offset values?
Will be really helpful if you provide me good reference?
Here is an example:
res.writeHead(...);
var SIZE = 64; // 64 byte intervals
fs.open(filepath, 'r', function(err, fd) {
fs.fstat(fd, function(err, stats) {
var bufferSize = stats.size;
var buffer = new Buffer(bufferSize),
var bytesRead = 0;
while (bytesRead < bufferSize) {
var size = Math.min(SIZE, bufferSize - bytesRead);
var read = fs.readSync(fd, buffer, bytesRead, size, bytesRead);
bytesRead += read;
}
res.write(buffer);
});
});
Should I use any loop for setting and storing position and offset values?
Yes you can but be careful. In Node.js, most file system functions are asynchronous (non-blocking). As you probably realised, putting an asynchronous function in a loop is going to cause problems. You can tell if a function is asynchonous by
looking at the Node.js documentation and checking if there is a
callback parameter. So using read inside a loop is bad. We can instead use readSync. This is the synchronous (blocking) and is similar to C's read() function (which is also blocking).
I really don't know how to set position which tells where to read from the file and offset in the buffer to start writing at.
The arguments of the readSync function control both where to read from in the file and where in the destination buffer to write to.
// /----- where to start writing at in `buffer`
fs.readSync(fd, buffer, offset, length, position)
// \------- where to read from in the
// file given by `fd`
Note: The above style is idiomatic for C, but in Javascript it is considered poor practice -- the code will not scale well. In general you don't want to use synchronous functions ever because they block the single thread of execution that Javascript uses (aka "blocking the event loop").
From Express.js:
Synchronous functions and methods tie up the executing process until they return. A single call to a synchronous function might return in a few microseconds or milliseconds, however in high-traffic websites, these calls add up and reduce the performance of the app. Avoid their use in production. Although Node and many modules provide synchronous and asynchronous versions of their functions, always use the asynchronous version in production.
Using .pipe() and Streams (asynchronous style) is generally the way to go if you want the most idiomatic and performant code. Sorry to say this: no official sources / popular websites will describe file operations using synchronous functions via C-style blocking and buffering because it is bad practice in Node.js.

NodeJS: How to write a file parser using readStream?

I have a file in a binary format:
The format is as follows:
[4 - header bytes] [8 bytes - int64 - how many bytes to read following] [variable num of bytes (size of the int64) - read the actual information]
And then it repeats, so I must first read the first 12 bytes to determine how many more bytes I need to read.
I have tried:
var readStream = fs.createReadStream('/path/to/file.bin');
readStream.on('data', function(chunk) { ... })
The problem I have is that chunk always comes back in chunks of 65536 bytes at a time whereas I need to be more specific on the number of bytes that I am reading.
I have always tried readStream.on('readable', function() { readStream.read(4) })
But it is also not very flexible, because it seems to turn asynchronous code into synchronous code because, I have to put the 'reading' in a while loop
Or maybe readStream is not appropriate in this case and I should use this instead? fs.read(fd, buffer, offset, length, position, callback)
Here's what I'd recommend as an abstract handler of a readStream to process abstract data like you're describing:
var pending = new Buffer(9999999);
var cursor = 0;
stream.on('data', function(d) {
d.copy(pending, cursor);
cursor += d.length;
var test = attemptToParse(pending.slice(0, cursor));
while (test !== false) {
// test is a valid blob of data
processTheThing(test);
var rawSize = test.raw.length; // How many bytes of data did the blob actually take up?
pending.copy(pending.copy, 0, rawSize, cursor); // Copy the data after the valid blob to the beginning of the pending buffer
cursor -= rawSize;
test = attemptToParse(pending.slice(0, cursor)); // Is there more than one valid blob of data in this chunk? Keep processing if so
}
});
For your use-case, ensure the initialized size of the pending Buffer is large enough to hold the largest possible valid blob of data you'll be parsing (you mention an int64; that max size plus the header size) plus one extra 65536 bytes in case the blob boundary happens just on the edge of a stream chunk.
My method requires a attemptToParse() method that takes a buffer and tries to parse the data out of it. It should return false if the length of the buffer is too short (data hasn't come in enough yet). If it is a valid object, it should return some parsed object that has a way to show the raw bytes it took up (.raw property in my example). Then you do any processing you need to do with the blob (processTheThing()), trim out that valid blob of data, shift the pending Buffer to just be the remainder and keep going. That way, you don't have a constantly growing pending buffer, or some array of "finished" blobs. Maybe process on the receiving end of processTheThing() is keeping an array of the blobs in memory, maybe it's writing them to a database, but in this example, that's abstracted away so this code just deals with how to handle the stream data.
Add the chunk to a Buffer, and then parse the data from there. Being aware not to go beyond the end of the buffer (if your data is large). I'm using my tablet right now so can't add any example source code. Maybe somebody else can?
Ok, mini source, very skeletal.
var chunks = [];
var bytesRead= 0;
stream.on('data', function(chunk) {
chunks.push(chunk);
bytesRead += chunk.length;
// look at bytesRead...
var buffer = Buffer.concat(chunks);
chunks = [buffer]; // trick for next event
// --> or, if memory is an issue, remove completed data from the beginning of chunks
// work with the buffer here...
}

How do I do random access reads from (large) files using node.js?

Am I missing something or does node.js's standard file I/O module lack analogs of the usual file random access methods?
seek() / fseek()
tell() / ftell()
How does one read random fixed-size records from large files in node without these?
tell is not, but it is pretty rare to not already know the position you are at in a file, or to not have a way to keep track yourself.
seek is exposed indirectly via the position argument of fs.read and fs.write. When given, the argument will seek to that location before performing its operation, and if null, it will use whatever previous position it had.
node doesn't have these built in, the closest you can get is to use fs.createReadStream with a start parameter to start reading from an offset, (pass in an existing fd to avoid re-opening the file).
http://nodejs.org/api/fs.html#fs_fs_createreadstream_path_options
I suppose that createReadStream creates new file descriptor over and over. I prefer sync version:
function FileBuffer(path) {
const fd = fs.openSync(path, 'r');
function slice(start, end) {
const chunkSize = end - start;
const buffer = new Buffer(chunkSize);
fs.readSync(fd, buffer, 0, chunkSize, start);
return buffer;
}
function close() {
fs.close(fd);
}
return {
slice,
close
}
}
Use this:
fs.open(path, flags[, mode], callback)
Then this:
fs.read(fd, buffer, offset, length, position, callback)
Read this for details:
https://nodejs.org/api/fs.html#fs_fs_read_fd_buffer_offset_length_position_callback

Resources