We have a huge text file which we want to manipulate using stream line by line.
Is there a way to use Node.js readline module in a transform stream? For instance to make the whole text to use capital letter (processing it line by line)?
event-stream might be a better fit. It can split the input on lines and transform those lines in various ways (+ more).
For instance, to uppercase everything read from stdin:
const es = require('event-stream');
process.stdin
.pipe(es.split()) // split lines
.pipe(es.mapSync(data => data.toUpperCase())) // uppercase the line
.pipe(es.join('\n')) // add a newline again
.pipe(process.stdout); // write to stdout
Related
I have a huge CSV (1,5GB) which I need to process line by line and construct 2 xml files. When I run the processing alone my program takes about 4 minutes to execute, if I also generate my xml files it takes over 2.5 hours to generate two 9GB xml files.
My code for writing the xml files is really simple, I use fs.appendFileSync to write my opening/closing xml tags and the text inside them. To sanitize the data I run this function on the text inside the xml tags.
function() {
return this.replace(/&/g, "&")
.replace(/</g, "<")
.replace(/>/g, ">")
.replace(/"/g, """)
.replace(/'/g, "'");
};
Is there something I could optimize to reduce the execution time?
fs.appendFileSync() is a relatively expensive operation: it opens the file, appends the data, then closes it again.
It'll be faster to use a writeable stream:
const fs = require('node:fs');
// create the stream
const stream = fs.createWriteStream('output.xml');
// then for each chunk of XML
stream.write(yourXML);
// when done, end the stream to close the file
stream.end();
I drastically reduced the execution time (to 30 minutes) by doing 2 things.
Setting the ENV variable UV_THREADPOOL_SIZE=64
Buffering my writes to the xml file (I flush the buffer to the file after 20,000 closed tags)
How can I convert the numeric file descriptor in process.stdin to a FileHandle object like those returned by fs.promises.open()?
Rationale:
want to work with stdin or a named input file in a uniform way
hate that uniform way to be based on numeric file descriptors (which could be done by using filehandle.fd, but eughh)
There does not seem to be a stable way to get a FileHandle from a fd value, at least as of 19.2.0. There is a complicated work-around here that might work, but it is clearly not a recommended approach: https://github.com/nodejs/node/issues/43821
If you're okay not supporting Windows, you could do:
import fs from "node:fs/promises"
const inputFileHandle = await fs.open("/dev/stdin", "r")
const outputFileHandle = await fs.open("/dev/stdout", "w")
It doesn't actually use the same underlying file descriptor as process.stdin.fd and process.stdout.fd (0 and 1, respectively), but it should achieve basically the same effect.
https://nodejs.org/api/readline.html
provides this solution for reading large files like CSVs line by line:
const { createReadStream } = require('fs');
const { createInterface } = require('readline');
(async function processLineByLine() {
try {
const rl = createInterface({
input: createReadStream('big-file.txt'),
crlfDelay: Infinity
});
rl.on('line', (line) => {
// Process the line.
});
await once(rl, 'close');
console.log('File processed.');
} catch (err) {
console.error(err);
}
})();
But I dont want to read the entire file from beginning to end but parts of it say from line number 1 to 10000, 20000 to 30000, etc.
Basically I want to be able to set a 'start' & 'end' line for a given run of my function.
Is this doable with readline & fs.createReadStream?
If not please suggest alternate approach.
PS: It's a large file (around 1 GB) & loading it in memory causes memory issues.
But I don't want to read the entire file from beginning to end but parts of it say from line number 1 to 10000, 20000 to 30000, etc.
Unless your lines are of fixed, identical length, there is NO way to know where line 10,000 starts without reading from the beginning of the file and counting lines until you get to line 10,000. That's how text files with variable length lines work. Lines in the file are not physical structures that the file system knows anything about. To the file system, the file is just a gigantic blob of data. The concept of lines is something we invent at a higher level and thus the file system or OS knows nothing about lines. The only way to know where lines are is to read the data and "parse" it into lines by searching for line delimiters. So, line 10,000 is only found by searching for the 10,000th line delimiter starting from the beginning of the file and counting.
There is no way around it, unless you preprocess the file into a more efficient format (like a database) or create an index of line positions.
Basically I want to be able to set a 'start' & 'end' line for a given run of my function.
The only way to do that is to "index" the data ahead of time so you already know where each line starts/ends. Some text editors made to handle very large files do this. They read through the file (perhaps lazily) reading every line and build an in-memory index of what file offset each line starts at. Then, they can retrieve specific blocks of lines by consulting the index and reading that set of data from the file.
Is this doable with readline & fs.createReadStream?
Without fixed length lines, there's no way to know where in the file line 10,000 starts without counting from the beginning.
It's a large file(around 1 GB) & loading it in memory causes MEMORY ISSUES.
Streaming the file a line at a time with the linereader module or others that do something similar will handle the memory issue just fine so that only a block of data from the file is in memory at any given time. You can handle arbitrarily large files even in a small memory system this way.
A new line is just a character (or two characters if you're on windows), you have no way of knowing where those characters are without processing the file.
You are however able to read only a certain byte range in a file. If you know for a fact that every line contains 64 bytes, you can skip the first 100 lines by starting your read at byte 6400, and you can read only 100 lines by stopping your read at byte 12800.
Details on how to specify start and end points are available in the createReadStream docs.
I am trying to read (via streaming) a large file in a Lambda function. My goal is to just read the first few lines and look for some information. The input file in S3 seems to have hex characters (NUL) and the following code stops reading the line when it hits the NUL character and goes to the next line. I would like to know how can I read the whole line and replace/remove the NUL character before I look for the information in the line. Here is the code that does not work as expected:
var readline = require('line-reader');
var readStream = s3.getObject({Bucket: S3Bucket, Key: fileName}).createReadStream();
readline.eachLine(readStream, {separator: '\n', encoding: 'utf8'}, function(line) {
console.log('Line ',line);
});
As mentioned by Brad its hard to help as this is more an issue with your line-reader lib.
I can offer an alternate solution however that doesn't require the lib.
I would use GetObject as you are, but I would also specify a value for the range parameter, then work my way through the file in chunks and then stop reading chunks when I am satisfied.
If the chunk I read doesn't have a /n then read another chunk, keep going until I get a /n, then read from the start of my buffered data to /n, set the new starting position int based on the position of /n and then read a new chunk from that position if you want to read more data.
Check out the range parameter in the api:
http://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/S3.html#getObject-property
Considering this node.js app below:
var spawn = require('child_process').spawn,
dir = spawn('dir', ['*.txt', '/b', '/s']);
dir.stdout.on('data', function (data) {
//(A)
console.log('stdout: ' + data);
});
In (A), the on data event wait for stdout output and we can imagine that the output came 'line by line' from cmd /c dir *.txt /b /s.
But it doesn't happen. In data variable, the stdout output came with more than one line and to process something with each file path we have to split by CRLF (\r\n). Why this does happen?
Because this is just a pure data stream from the child process's standard output. There is no knowledge of whether that data is in any particular format, or whether it will contain any specific characters at all. So the data is treated like a stream of bytes and handled in chunks with no regard for the content or meaning of those bytes. That's the most general form of piping data around the system. Note, however, that there are wrapper streams that will buffer the raw data stream and give you a series of lines of text. You will find many modules for this on npmjs.org