How to load a large file line-by-line in Nodejs - node.js

What is the best way to read a very large file line-by-line (like a text file having 50,000,000 lines, the file size is larger than 8GB). Actually, there are dozens of thus files.
Basically, it need to read those file one-by-one. For each file, after read one line, it will do something that take some time to complete (such as, make an http request to send the data to a remote service. In this case, the http request is slower than reading file.)
Thank you.

What? Send 50,000,000 http request?Are you making a DDOS?
Maybe you should try another way.
Why not upload this file?

Related

Node.js read and write stream to the same file at the same time

TL;DR
I'm browsing through a number of solutions on npm and github looking for something that would allow me to read and write to the same file in two different places at the same time. So far I'm having trouble actually finding anything like this. Is there a module of some sort that will allow that?
Background
In essence my requirement is that in a large file I need to, in the following order:
read
transform
write
Ideally the usage would be something like:
const fd = fs.open(file, "r+");
const read = createReadStreamSomehowFrom(fd);
const write = createWriteStreamSomehowFrom(fd);
read
.pipe(new Transform(transform() {...}))
.pipe(write);
I could do that with standard fs.create[Read/Write]Stream but there's no way to control the flow of both streams and if my write position goes beyond read position then I'm reading something I just wrote...
The use case is the same as perl -p -i -e, read and write to the same file (meaning the same inode) asynchronously and replace the contents without loading everything into memory.
I would expect this a real world use case, yet all implementations I found actually load the whole file into memory and then save it. Am I missing a known module here or is there a need to actually write something like this?
Hmm... a tough one it seems. :)
So here's for the record - I found no such module and actually discussed this with some people responsible for a nice in-file replacing module. Seeing no way to solve this I decided to write it from scratch and here it is:
signicode/rw-stream repo on github
rw-stream at npm
The module works on a simple principle that no byte can be written until it has been consumed in the readable stream and it's fairly simple underneath (couple fs.read/write ops with keeping eye on the point of read and write).
If you find this useful then I'm happy. :)

How should I handle HEAD requests for large files in node.js?

Using my own node.js server I want to get the size of a large file (> 4gB) before making byte range requests on it. If, upon receiving a HEAD request, I use fs.readFile I get "RangeError: File size is greater than possible Buffer" errors; if I use fs.createReadStream I don't get that error but then I don't know how to respond to the request. For one thing, I don't see how to get the file size from the stream; for another, I don't how to fill out the response header even if I knew the file size. Any help would be greatly appreciated. Thanks.

ZIP file format. How to read file properly?

I'm currently working on one Node.js project. I want to have an ability to read, modify and write ZIP file without saving it into FS (we receive it by TCP and send it back after modifications were made), and so far it looks like possible bocause of simple ZIP file structure. Currently I refer to this documentation.
So ZIP file has simple structure:
File header 1
File data 1
File data descriptor 1
File header 2
File data 2
File data descriptor 2
...
[other not important yet]
First we need to read file header, which contains field compressed size, and it could be the perfect way to read file data 1 by it's length. But it's actually not. This field may contain '0' or '0xFFFFFFFF', and those values don't describe its actual length. In that case we have to read file data without information about it's length. But how?..
Compression/Decopression algorithm descriptions looks pretty complex to me, and I plan to use ZLIB for compression itself anyway. So if something useful described there, then I missed the point.
Can someone explain the proper way to read those files?
P.S. Please avoid suggesting npm modules. I do not want to only solve the problem, but also to understand how things work.
Note - I'm assuming you want to read and process the zip file as
it comes off the socket, rather than reading the complete zip file into
memory before processing. Both options are valid.
I'd initially ignore the use cases where the compressed size has a value of '0' or '0xFFFFFFFF'. The former is only present in zip files created in streaming mode, the latter for zip files larger than 4Gig.
Dealing with them adds a lot of complexity - you can add support for them later, if necessary. Whether you ever need to support the 0/0xFFFFFFFF use cases depends on the nature of the zip files you intend to process.
When the compression method is deflated (8), use zlib for compression/decompression. You also need to support compression method stored (0). It gets used for very small files where compression isn't appropriate.

How to synchronously read from a ReadStream in node

I am trying to read UTF-8 text from a file in a memory and time efficient way. There are two ways to read directly from a file synchronously:
fs.readFileSync will read the entire file and return a buffer containing the file's entire contents
fs.readSync will read a set amount of bytes from a file and return a buffer containing just those contents
I initially just used fs.readFileSync because it's easiest, but I'd like to be able to efficiently handle potentially large files by only reading in chunks of text at a time. So I started using fs.readSync instead. But then I realized that fs.readSync doesn't handle UTF-8 decoding. UTF-8 is simple, so I could whip up some logic to manually decode it, but Node already has services for that, so I'd like to avoid that if possible.
I noticed fs.createReadStream, which returns a ReadStream that can be used for exactly this purpose, but unfortunately it seems to only be available in an asynchronous mode of operation.
Is there a way to read from a ReadStream in a synchronous way? I have a massive stack built on top of this already, and I'd rather not have to refactor it to be asynchronous.
I discovered the string_decoder module, which handles all that UTF-8 decoding logic I was worried I'd have to write. At this point, it seems like a no-brainer to use this on top of fs.readSync to get the synchronous behavior I was looking for.
You basically just keep feeding bytes to it, and as it is able to successfully decode characters, it will emit them. The Node documentation is sufficient at describing how it works.

Using sed on a compressed file

I have written a file processing program and now it needs to read from a zipped file(.gz unzipped file may get as large as 2TB),
Is there a sed equivalent for zipped files like (zcat/cat) or else what would be the best approach to do the following efficiently
ONE=`zcat filename.gz| sed -n $counts`
$counts : counter to read(line by line)
The above method works, but is quite slow for large file as I need to read each line and perform the matching on certain fields.
Thanks
EDIT
Though not directly helpful, here are a set of zcommands
http://www.cyberciti.biz/tips/decompress-and-expand-text-files.html
Well you either can have more speed (i.e. use uncompressed files) or more free space (i.e. use compressed files and the pipe you showed)... sorry. Using compressed files will always have an overhead.
If you understand the internal structure of the compression format it is possible that you could write a pattern matcher that can operate on compressed data without fully decompressing it, but instead by simply determining from the compressed data if the pattern would be present in a given piece of decompressed data.
If the pattern has any complexity at all this sounds like quite a complicated project as you'd have to handle cases where the pattern could be satisfied by the combination of output from two (or more) separate pieces of decompression.

Resources