Node.js "readline" + "fs. createReadStream" : Specify start & end line number - node.js

https://nodejs.org/api/readline.html
provides this solution for reading large files like CSVs line by line:
const { createReadStream } = require('fs');
const { createInterface } = require('readline');
(async function processLineByLine() {
try {
const rl = createInterface({
input: createReadStream('big-file.txt'),
crlfDelay: Infinity
});
rl.on('line', (line) => {
// Process the line.
});
await once(rl, 'close');
console.log('File processed.');
} catch (err) {
console.error(err);
}
})();
But I dont want to read the entire file from beginning to end but parts of it say from line number 1 to 10000, 20000 to 30000, etc.
Basically I want to be able to set a 'start' & 'end' line for a given run of my function.
Is this doable with readline & fs.createReadStream?
If not please suggest alternate approach.
PS: It's a large file (around 1 GB) & loading it in memory causes memory issues.

But I don't want to read the entire file from beginning to end but parts of it say from line number 1 to 10000, 20000 to 30000, etc.
Unless your lines are of fixed, identical length, there is NO way to know where line 10,000 starts without reading from the beginning of the file and counting lines until you get to line 10,000. That's how text files with variable length lines work. Lines in the file are not physical structures that the file system knows anything about. To the file system, the file is just a gigantic blob of data. The concept of lines is something we invent at a higher level and thus the file system or OS knows nothing about lines. The only way to know where lines are is to read the data and "parse" it into lines by searching for line delimiters. So, line 10,000 is only found by searching for the 10,000th line delimiter starting from the beginning of the file and counting.
There is no way around it, unless you preprocess the file into a more efficient format (like a database) or create an index of line positions.
Basically I want to be able to set a 'start' & 'end' line for a given run of my function.
The only way to do that is to "index" the data ahead of time so you already know where each line starts/ends. Some text editors made to handle very large files do this. They read through the file (perhaps lazily) reading every line and build an in-memory index of what file offset each line starts at. Then, they can retrieve specific blocks of lines by consulting the index and reading that set of data from the file.
Is this doable with readline & fs.createReadStream?
Without fixed length lines, there's no way to know where in the file line 10,000 starts without counting from the beginning.
It's a large file(around 1 GB) & loading it in memory causes MEMORY ISSUES.
Streaming the file a line at a time with the linereader module or others that do something similar will handle the memory issue just fine so that only a block of data from the file is in memory at any given time. You can handle arbitrarily large files even in a small memory system this way.

A new line is just a character (or two characters if you're on windows), you have no way of knowing where those characters are without processing the file.
You are however able to read only a certain byte range in a file. If you know for a fact that every line contains 64 bytes, you can skip the first 100 lines by starting your read at byte 6400, and you can read only 100 lines by stopping your read at byte 12800.
Details on how to specify start and end points are available in the createReadStream docs.

Related

Nodejs how to optimize writing very large xml files?

I have a huge CSV (1,5GB) which I need to process line by line and construct 2 xml files. When I run the processing alone my program takes about 4 minutes to execute, if I also generate my xml files it takes over 2.5 hours to generate two 9GB xml files.
My code for writing the xml files is really simple, I use fs.appendFileSync to write my opening/closing xml tags and the text inside them. To sanitize the data I run this function on the text inside the xml tags.
function() {
return this.replace(/&/g, "&")
.replace(/</g, "<")
.replace(/>/g, ">")
.replace(/"/g, """)
.replace(/'/g, "&apos;");
};
Is there something I could optimize to reduce the execution time?
fs.appendFileSync() is a relatively expensive operation: it opens the file, appends the data, then closes it again.
It'll be faster to use a writeable stream:
const fs = require('node:fs');
// create the stream
const stream = fs.createWriteStream('output.xml');
// then for each chunk of XML
stream.write(yourXML);
// when done, end the stream to close the file
stream.end();
I drastically reduced the execution time (to 30 minutes) by doing 2 things.
Setting the ENV variable UV_THREADPOOL_SIZE=64
Buffering my writes to the xml file (I flush the buffer to the file after 20,000 closed tags)

Why is the same function in python-chess returning different results?

I'm new to working with python-chess and I was perusing the official documentation. I noticed this very weird thing I just can't make sense of. This is from the documentation:
import chess.pgn
pgn = open("data/pgn/kasparov-deep-blue-1997.pgn")
first_game = chess.pgn.read_game(pgn)
second_game = chess.pgn.read_game(pgn)
So as you can see the exact same function pgn.read_game() results in two different games to show up. I tried with my own pgn file and sure enough first_game == second_game resulted in False. I also tried third_game = chess.pgn.read_game() and sure enough that gave me the (presumably) third game from the pgn file. How is this possible? If I'm using the same function shouldn't it return the same result every time for the same file? Why should the variable name matter(I'm assuming it does) unless programming languages changed overnight or there's a random function built-in somewhere?
The only way that this can be possible is if some data is changing. This could be data that chess.pgn.read_game reads from elsewhere, or could be something to do with the object you're passing in.
In Python, file-like objects store where they are in the file. If they didn't, then this code:
with open("/home/wizzwizz4/Documents/TOPSECRET/diary.txt") as f:
line = f.readline()
while line:
print(line, end="")
line = f.readline()
would just print the first line over and over again. When data's read from a file, Python won't give you that data again unless you specifically ask for it.
There are multiple games in this file, stored one after each other. You're passing in the same file each time, but you're not resetting the read cursor to the beginning of the file (f.seek(0)) or closing and reopening the file, so it's going to read the next data available – i.e., the next game.

Read First line from a file stored in Azure Cloud Blob Storage

I'm trying to read the First line of the file, the file is stored in Azure Storage Blob Container. Below code snippet is standard code to read a file till end and write the content:
foreach (IListBlobItem item in container.ListBlobs(null, false))
{
if (item.GetType() == typeof(CloudBlockBlob))
{
CloudBlockBlob blob = (CloudBlockBlob)item;
using (var stream = blob.OpenRead())
{
using (StreamReader reader = new StreamReader(stream))
{
while (!reader.EndOfStream)
{
Console.WriteLine(reader.ReadLine().First());
//Console.WriteLine(reader.ReadLine());
}
}
}
}
}
I want the first line of file. But I cannot use "while (!reader.EndOfStream)" as it reads whole file and then writes to console line by line.
Also I cannot load the whole file as the file size is more than 3GB.
How do I get hold of only the first line from the file stored in Azure Blob Storage?
Azure Blob Storage supports reading byte ranges. So you don't really need to download the entire blob to read just the first line in the file. The method you would want to use is CloudBlob.DownloadRangeToByteArray.
Let's assume that the lines in the blob are separated by Line Feed (\n or character code 10). With this assumption here's what you would need to do:
You could choose to progressively read just a single byte starting from 0th byte in a loop. You store the the byte you read in some kind of byte buffer. You will continue to read till the time you encounter this line feed character. As soon as you encounter this, you would break out of the loop. Whatever you have in the buffer will be your first line.
Instead of reading a single byte, you could also read a larger byte range (say 1024 bytes or may be larger). Once you get these bytes, you would look for this new line character in the byte range download. If you find this character, you would split the array to the index of this character and that would be your first line. If you don't encounter this character, you will put the data fetched in some kind of buffer and you will read next 1K bytes. You will continue doing this till the time you encounter this new line character. Once you find it, you will use the buffer plus the last set of bytes received and that would be your first line.
Azure Storage blobs are not the same as local file objects. If you want to do specific parsing of a blob, you need to copy it locally first, and then open it as a proper file. And yes, for a 3GB file, given the 60MB/s per-blob transfer rate, this could take some time. (so you might want to consider storing parts of the blob, such as the first line, in a secondary, searchable storage area, for these purposes).
Although a year late and I have not tried it but instead of using
while (!reader.EndOfStream)
{
Console.WriteLine(reader.ReadLine().First());
}
have you tried this
if(!reader.EndOfStream)
{
Console.WriteLine(reader.ReadLine());
}

NodeJS - Stream a large ASCII file from S3 with Hex Charcaters (NUL)

I am trying to read (via streaming) a large file in a Lambda function. My goal is to just read the first few lines and look for some information. The input file in S3 seems to have hex characters (NUL) and the following code stops reading the line when it hits the NUL character and goes to the next line. I would like to know how can I read the whole line and replace/remove the NUL character before I look for the information in the line. Here is the code that does not work as expected:
var readline = require('line-reader');
var readStream = s3.getObject({Bucket: S3Bucket, Key: fileName}).createReadStream();
readline.eachLine(readStream, {separator: '\n', encoding: 'utf8'}, function(line) {
console.log('Line ',line);
});
As mentioned by Brad its hard to help as this is more an issue with your line-reader lib.
I can offer an alternate solution however that doesn't require the lib.
I would use GetObject as you are, but I would also specify a value for the range parameter, then work my way through the file in chunks and then stop reading chunks when I am satisfied.
If the chunk I read doesn't have a /n then read another chunk, keep going until I get a /n, then read from the start of my buffered data to /n, set the new starting position int based on the position of /n and then read a new chunk from that position if you want to read more data.
Check out the range parameter in the api:
http://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/S3.html#getObject-property

Gulp extract text from a file

I want to build up a table of contents file based on the comments from the first line of each file.
I can get to the files no problem and read the contents, but that only returns a buffer of the file.
I want the check if the first line is comments. if it is then extract that line and add it to a new file.
var bufferContents = through.obj(function(file,enc,cb){
console.log(file.contents);
});
If the file is pretty large, I would recommend use lazy module.
https://github.com/jpommerening/node-lazystream
Or if line length is specified, you can set chunk size.
https://stackoverflow.com/a/19426486/1502019

Resources