NodeJS - Stream a large ASCII file from S3 with Hex Charcaters (NUL) - node.js

I am trying to read (via streaming) a large file in a Lambda function. My goal is to just read the first few lines and look for some information. The input file in S3 seems to have hex characters (NUL) and the following code stops reading the line when it hits the NUL character and goes to the next line. I would like to know how can I read the whole line and replace/remove the NUL character before I look for the information in the line. Here is the code that does not work as expected:
var readline = require('line-reader');
var readStream = s3.getObject({Bucket: S3Bucket, Key: fileName}).createReadStream();
readline.eachLine(readStream, {separator: '\n', encoding: 'utf8'}, function(line) {
console.log('Line ',line);
});

As mentioned by Brad its hard to help as this is more an issue with your line-reader lib.
I can offer an alternate solution however that doesn't require the lib.
I would use GetObject as you are, but I would also specify a value for the range parameter, then work my way through the file in chunks and then stop reading chunks when I am satisfied.
If the chunk I read doesn't have a /n then read another chunk, keep going until I get a /n, then read from the start of my buffered data to /n, set the new starting position int based on the position of /n and then read a new chunk from that position if you want to read more data.
Check out the range parameter in the api:
http://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/S3.html#getObject-property

Related

Node.js "readline" + "fs. createReadStream" : Specify start & end line number

https://nodejs.org/api/readline.html
provides this solution for reading large files like CSVs line by line:
const { createReadStream } = require('fs');
const { createInterface } = require('readline');
(async function processLineByLine() {
try {
const rl = createInterface({
input: createReadStream('big-file.txt'),
crlfDelay: Infinity
});
rl.on('line', (line) => {
// Process the line.
});
await once(rl, 'close');
console.log('File processed.');
} catch (err) {
console.error(err);
}
})();
But I dont want to read the entire file from beginning to end but parts of it say from line number 1 to 10000, 20000 to 30000, etc.
Basically I want to be able to set a 'start' & 'end' line for a given run of my function.
Is this doable with readline & fs.createReadStream?
If not please suggest alternate approach.
PS: It's a large file (around 1 GB) & loading it in memory causes memory issues.
But I don't want to read the entire file from beginning to end but parts of it say from line number 1 to 10000, 20000 to 30000, etc.
Unless your lines are of fixed, identical length, there is NO way to know where line 10,000 starts without reading from the beginning of the file and counting lines until you get to line 10,000. That's how text files with variable length lines work. Lines in the file are not physical structures that the file system knows anything about. To the file system, the file is just a gigantic blob of data. The concept of lines is something we invent at a higher level and thus the file system or OS knows nothing about lines. The only way to know where lines are is to read the data and "parse" it into lines by searching for line delimiters. So, line 10,000 is only found by searching for the 10,000th line delimiter starting from the beginning of the file and counting.
There is no way around it, unless you preprocess the file into a more efficient format (like a database) or create an index of line positions.
Basically I want to be able to set a 'start' & 'end' line for a given run of my function.
The only way to do that is to "index" the data ahead of time so you already know where each line starts/ends. Some text editors made to handle very large files do this. They read through the file (perhaps lazily) reading every line and build an in-memory index of what file offset each line starts at. Then, they can retrieve specific blocks of lines by consulting the index and reading that set of data from the file.
Is this doable with readline & fs.createReadStream?
Without fixed length lines, there's no way to know where in the file line 10,000 starts without counting from the beginning.
It's a large file(around 1 GB) & loading it in memory causes MEMORY ISSUES.
Streaming the file a line at a time with the linereader module or others that do something similar will handle the memory issue just fine so that only a block of data from the file is in memory at any given time. You can handle arbitrarily large files even in a small memory system this way.
A new line is just a character (or two characters if you're on windows), you have no way of knowing where those characters are without processing the file.
You are however able to read only a certain byte range in a file. If you know for a fact that every line contains 64 bytes, you can skip the first 100 lines by starting your read at byte 6400, and you can read only 100 lines by stopping your read at byte 12800.
Details on how to specify start and end points are available in the createReadStream docs.

Converting a nodejs buffer to string and back to buffer gives a different result in some cases

I created a .docx file.
Now, I do this:
// read the file to a buffer
const data = await fs.promises.readFile('<pathToMy.docx>')
// Converts the buffer to a string using 'utf8' but we could use any encoding
const stringContent = data.toString()
// Converts the string back to a buffer using the same encoding
const newData = Buffer.from(stringContent)
// We expect the values to be equal...
console.log(data.equals(newData)) // -> false
I don't understand in what step of the process the bytes are being changed...
I already spent sooo much time trying to figure this out, without any result... If someone can help me understand what part I'm missing out, it would be really awesome!
A .docXfile is not a UTF-8 string (it's a binary ZIP file) so when you read it into a Buffer object and then call .toString() on it, you're assuming it is already encoding as UTF-8 in the buffer and you want to now move it into a Javascript string. That's not what you have. Your binary data will likely encounter things that are invalid in UTF-8 and those will be discarded or coerced into valid UTF-8, causing an irreversible change.
What Buffer.toString() does is take a Buffer that is ALREADY encoded in UTF-8 and puts it into a Javascript string. See this comment in the doc,
If encoding is 'utf8' and a byte sequence in the input is not valid UTF-8, then each invalid byte is replaced with the replacement character U+FFFD.
So, the code you show in your question is wrongly assuming that Buffer.toString() takes binary data and reversibly encodes it as a UTF8 string. That is not what it does and that's why it doesn't do what you are expecting.
Your question doesn't describe what you're actually trying to accomplish. If you want to do something useful with the .docX file, you probably need to actually parse it from it's binary ZIP file form into the actual components of the file in their appropriate format.
Now that you explain you're trying to store it in localStorage, then you need to encode the binary into a string format. One such popular option is Base64 though it isn't super efficient (size wise), but it is better than many others. See Binary Data in JSON String. Something better than Base64 for prior discussion on this topic. Ignore the notes about compression in that other answer because your data is already ZIP compressed.

Nodejs fs.readfile vs new buffer binary

I have a situation where I receive a base64 encoded image, decode it, then want to use it in some analysis activity.
I can use Buffer to go from base64 to binary but i seem to be unable to use that output as expected (as an image).
The solution now is to convert to binary, write it to a file, then read that file again. The FS output can be used as an image but this approach seems a bit inefficient and additional steps as i would expect the buffer output to also be a usable image as it has the same data?
my question, is how does the fs.readfile output differ from the buffer output? And is there a way i can use the buffer output as i would the fs output?
Buffer from a base64 string:
var bin = new Buffer(base64String, 'base64').toString('binary');
Read a file
var bin = fs.readFileSync('image.jpg');
Many thanks

Read First line from a file stored in Azure Cloud Blob Storage

I'm trying to read the First line of the file, the file is stored in Azure Storage Blob Container. Below code snippet is standard code to read a file till end and write the content:
foreach (IListBlobItem item in container.ListBlobs(null, false))
{
if (item.GetType() == typeof(CloudBlockBlob))
{
CloudBlockBlob blob = (CloudBlockBlob)item;
using (var stream = blob.OpenRead())
{
using (StreamReader reader = new StreamReader(stream))
{
while (!reader.EndOfStream)
{
Console.WriteLine(reader.ReadLine().First());
//Console.WriteLine(reader.ReadLine());
}
}
}
}
}
I want the first line of file. But I cannot use "while (!reader.EndOfStream)" as it reads whole file and then writes to console line by line.
Also I cannot load the whole file as the file size is more than 3GB.
How do I get hold of only the first line from the file stored in Azure Blob Storage?
Azure Blob Storage supports reading byte ranges. So you don't really need to download the entire blob to read just the first line in the file. The method you would want to use is CloudBlob.DownloadRangeToByteArray.
Let's assume that the lines in the blob are separated by Line Feed (\n or character code 10). With this assumption here's what you would need to do:
You could choose to progressively read just a single byte starting from 0th byte in a loop. You store the the byte you read in some kind of byte buffer. You will continue to read till the time you encounter this line feed character. As soon as you encounter this, you would break out of the loop. Whatever you have in the buffer will be your first line.
Instead of reading a single byte, you could also read a larger byte range (say 1024 bytes or may be larger). Once you get these bytes, you would look for this new line character in the byte range download. If you find this character, you would split the array to the index of this character and that would be your first line. If you don't encounter this character, you will put the data fetched in some kind of buffer and you will read next 1K bytes. You will continue doing this till the time you encounter this new line character. Once you find it, you will use the buffer plus the last set of bytes received and that would be your first line.
Azure Storage blobs are not the same as local file objects. If you want to do specific parsing of a blob, you need to copy it locally first, and then open it as a proper file. And yes, for a 3GB file, given the 60MB/s per-blob transfer rate, this could take some time. (so you might want to consider storing parts of the blob, such as the first line, in a secondary, searchable storage area, for these purposes).
Although a year late and I have not tried it but instead of using
while (!reader.EndOfStream)
{
Console.WriteLine(reader.ReadLine().First());
}
have you tried this
if(!reader.EndOfStream)
{
Console.WriteLine(reader.ReadLine());
}

Using Node.js readline in transform streams

We have a huge text file which we want to manipulate using stream line by line.
Is there a way to use Node.js readline module in a transform stream? For instance to make the whole text to use capital letter (processing it line by line)?
event-stream might be a better fit. It can split the input on lines and transform those lines in various ways (+ more).
For instance, to uppercase everything read from stdin:
const es = require('event-stream');
process.stdin
.pipe(es.split()) // split lines
.pipe(es.mapSync(data => data.toUpperCase())) // uppercase the line
.pipe(es.join('\n')) // add a newline again
.pipe(process.stdout); // write to stdout

Resources