How to convert the accumulated data of 'fflate' zip stream to blob for download - browser

The application will create a large data in GBs. After storing to an array and the passing to zip() worked to a limit , which the heap filled. So my plan is
convert each line using ZipDeflate .push()
ondata store to an array
convert the array to blob that can be saved as a zip file
findings
data is an Uint8Array
new Blob([mergedDatas]) was not a zip file
const zippedData = []
const zip = new Zip();
zip.ondata = (err, data, final) => {
if (!err) {
zippedData.push(data)
if (final) {
// Here I want the zipped data to be downloaded
}
}
};
const frames = new ZipDeflate(exportName + '.lsf', {
level: 9,
});
zip.add(frames);
frames.push(header); // this is a uintd binary data
// there is lot of loops and things that will generate GBs of data eg. below two lines
frames.push(numberToBytes(branch.name));
frames.push(ColorByteArray);
// end of loop
frames.push(new Uint8Array(0), true);
zip.end();

Finally found myself
let processedZipStreams = [];
zip.ondata = async (err, dat, final) => {
if (err) {
} else {
processedZipStreams.push(dat);
generatorInstance.next();
if (final) {
SaveAs(
new Blob(processedZipStreams, { type: 'application/octet-stream' }),
exportName,
true
);
callback(100);
const duration = performance.now() - start;
console.log(duration / 1000);
}
}
};
processed ZipStreams can be packed to a blob and downloaded

Related

nodejs - Async generator/iterator with or without awaiting long operation

I'm trying to understand which setup is the best for doing the following operations:
Read line by line a CSV file
Use the row data as input of a complex function that at the end outputs a file (one file for each row)
When the entire process is finished I need to zip all the files generated during step 2
My goal: fast and scalable solution able to handle huge files
I've implemented step 2 using two approaches and I'd like to know what is the best and why (or if there are other better ways)
Step 1
This is simple and I rely on CSV Parser - async iterator API:
async function* loadCsvFile(filepath, params = {}) {
try {
const parameters = {
...csvParametersDefault,
...params,
};
const inputStream = fs.createReadStream(filepath);
const csvParser = parse(parameters);
const parser = inputStream.pipe(csvParser)
for await (const line of parser) {
yield line;
}
} catch (err) {
throw new Error("error while reading csv file: " + err.message);
}
}
Step 2
Option 1
Await the long operation handleCsvLine for each line:
// step 1
const csvIterator = loadCsvFile(filePath, options);
// step 2
let counter = 0;
for await (const row of csvIterator) {
await handleCvsLine(
row,
);
counter++;
if (counter % 50 === 0) {
logger.debug(`Processed label ${counter}`);
}
}
// step 3
zipFolder(folderPath);
Pro
nice to see the files being generated one after the other
since it wait for the operation to end I can show the progress nicely
Cons
it waits for each operation, can I be faster?
Option 2
Push the long operation handleCsvLine in an array and then after the loop do Promise.all:
// step 1
const csvIterator = loadCsvFile(filePath, options);
// step 2
let counter = 0;
const promises = [];
for await (const row of csvIterator) {
promises.push(handleCvsLine(row));
counter++;
if (counter % 50 === 0) {
logger.debug(`Processed label ${counter}`);
}
}
await Promise.all(promises);
// step 3
zipFolder(folderPath);
Pro
I do not wait, so it should be faster, isn't it?
Cons
since it does not wait, the for loop is very fast but then there is a long wait at the end (aka, bad progress experience)
Step 3
A simple step in which I use the archiver library to create a zip of the folder in which I saved the files from step 2:
function zipFolder(folderPath, globPath, outputFolder, outputName, logger) {
return new Promise((resolve, reject) => {
// create a file to stream archive data to.
const stream = fs.createWriteStream(path.join(outputFolder, outputName));
const archive = archiver("zip", {
zlib: { level: 9 }, // Sets the compression level.
});
archive.glob(globPath, { cwd: folderPath });
// good practice to catch warnings (ie stat failures and other non-blocking errors)
archive.on("warning", function (err) {
if (err.code === "ENOENT") {
logger.warning(err);
} else {
logger.error(err);
reject(err);
}
});
// good practice to catch this error explicitly
archive.on("error", function (err) {
logger.error(err);
reject(err);
});
// pipe archive data to the file
archive.pipe(stream);
// listen for all archive data to be written
// 'close' event is fired only when a file descriptor is involved
stream.on("close", function () {
resolve();
});
archive.finalize();
});
}
Not using await does not make operations faster. It will not wait for the response and will move to the next operation. It will keep adding operations to the event queue, with or without await.
You should use child_process instead to mock parallel processing. Node js is not multithreaded but you can achieve it using child_process, which runs on CPU cores. This way, you can generate multiple files at a time based on number of CPU cores available in the system.

Node Read Streams - How can I limit the number of open files?

I'm running into AggregateError: EMFILE: too many open files while streaming multiple files.
Machine Details:
MacOS Monterey,
MacBook Pro (14-inch, 2021),
Chip Apple M1 Pro,
Memory 16GB,
Node v16.13.0
I've tried increasing the limits with no luck.
Ideally I would like to be able to set the limit of the number of files open at one time or resolve by closing files as soon as they have been used.
Code below. I've tried to remove the unrelated code and replace it with '//...'.
const MultiStream = require('multistream');
const fs = require('fs-extra'); // Also tried graceful-fs and the standard fs
const { fdir } = require("fdir");
// Also have a require for the bz2 and split2 functions but editing from phone right now
//...
let files = [];
//...
(async() => {
const crawler = await new fdir()
.filter((path, isDirectory) => path.endsWith(".bz2"))
.withFullPaths()
.crawl("Dir/Sub Dir")
.withPromise();
for(const file of crawler){
files = [...files, fs.createReadStream(file)]
}
multi = await new MultiStream(files)
// Unzip
.pipe(bz2())
// Create chunks from lines
.pipe(split2())
.on('data', function (obj) {
// Code to filter data and extract what I need
//...
})
.on("error", function(error) {
// Handling parsing errors
//...
})
.on('end', function(error) {
// Output results
//...
})
})();
To prevent pre-opening a filehandle for every single file in your array, you want to only open the files upon demand when it's that particular file's turn to be streamed. And, you can do that with multi-stream.
Per the multi-stream doc, you can lazily create the readStreams by changing this:
for(const file of crawler){
files = [...files, fs.createReadStream(file)]
}
to this:
let files = crawler.map((f) => {
return function() {
return fs.createReadStream(f);
}
});
After reading over the npm page for multistream I think I have found something that will help. I have also edited where you are adding the stream to the files array as I don't see a need to instantiate a new array and spread existing elements like you are doing.
To lazily create the streams, wrap them in a function:
var streams = [
fs.createReadStream(__dirname + '/numbers/1.txt'),
function () { // will be executed when the stream is active
return fs.createReadStream(__dirname + '/numbers/2.txt')
},
function () { // same
return fs.createReadStream(__dirname + '/numbers/3.txt')
}
]
new MultiStream(streams).pipe(process.stdout) // => 123 ```
With that we can update your logic to include this functionality by simply wrapping the readStreams in functions, this way the streams will not be created until they are needed. This will prevent you from having too many open at once. We can do this by simply updating your file loop:
for(const file of crawler){
files.push(function() {
return fs.createReadStream(file)
})
}

SSE returned data is x50 large than the file size

We have a hapijs server running which returns an SSE stream containing rows returned from a readable SQL Server (NOT MySQL) stored procedure.
We observed 50,000 rows returned in chunks of 1000 events. Both Chrome and Firefox indicated that the size of the data returned in the request was just under 250mbs. We copied the data returned (JSON) and placed it into a .txt file and noticed that the data was only 20mb.
The content encoding is set to identity. We implemented SUSIE and are using that to return the event source.
let result = await server.methods.services.worklist.getData(request.query);
let rowsToProcess = [];
let chunkSizeSmall = 50; // #todo move this to the config
let chunkSizeLarge = 1000; // #todo move this to the config
let chunkSize = chunkSizeSmall;
result.on('recordset', columns => {
h.event({ event: 'columns', data: columns })
});
result.on('row', row => {
rowsToProcess.push(row);
if (rowsToProcess.length >= chunkSize) {
chunkSize = chunkSizeLarge;
result.pause();
processRows();
}
});
result.on('done', () => {
processRows();
h.event(null)
});
function processRows() {
// process rows
h.event({ event: 'data', data: rowsToProcess })
rowsToProcess = [];
result.resume();
}
return h.event({});
We are expecting the output to be the same or somewhat near the same size as the filesize of the JSON.

Nodejs Read very large file(~10GB), Process line by line then write to other file

I have a 10 GB log file in a particular format, I want to process this file line by line and then write the output to other file after applying some transformations. I am using node for this operation.
Though this method is fine but it takes a hell lot of time to do this. I was able to do this within 30-45 mins in JAVA, but in node it is taking more than 160 minutes to do the same job. Following is the code:
Following is the initiation code which reads each line from the input.
var path = '../10GB_input_file.txt';
var output_file = '../output.txt';
function fileopsmain(){
fs.exists(output_file, function(exists){
if(exists) {
fs.unlink(output_file, function (err) {
if (err) throw err;
console.log('successfully deleted ' + output_file);
});
}
});
new lazy(fs.createReadStream(path, {bufferSize: 128 * 4096}))
.lines
.forEach(function(line){
var line_arr = line.toString().split(';');
perform_line_ops(line_arr, line_arr[6], line_arr[7], line_arr[10]);
}
);
}
This is the method that performs some operation over that line and
passes the input to write method to write it into the output file.
function perform_line_ops(line_arr, range_start, range_end, daynums){
var _new_lines = '';
for(var i=0; i<days; i++){
//perform some operation to modify line pass it to print
}
write_line_ops(_new_lines);
}
Following method is used to write data into a new file.
function write_line_ops(line) {
if(line != null && line != ''){
fs.appendFileSync(output_file, line);
}
}
I want to bring this time down to 15-20 mins. Is it possible to do so.
Also for the record I'm trying this on a intel i7 processor with 8 GB of RAM.
You can do this easily without a module. For example:
var fs = require('fs');
var inspect = require('util').inspect;
var buffer = '';
var rs = fs.createReadStream('foo.log');
rs.on('data', function(chunk) {
var lines = (buffer + chunk).split(/\r?\n/g);
buffer = lines.pop();
for (var i = 0; i < lines.length; ++i) {
// do something with `lines[i]`
console.log('found line: ' + inspect(lines[i]));
}
});
rs.on('end', function() {
// optionally process `buffer` here if you want to treat leftover data without
// a newline as a "line"
console.log('ended on non-empty buffer: ' + inspect(buffer));
});
I can't guess where the possible bottleneck is in your code.
Can you add the library or the source code of the lazy function?
How many operations does your perform_line_ops do? (if/else, switch/case, function calls)
I've created a example based on your given code, I know that this does not answer your question but maybe helps you understand how node handles such case.
const fs = require('fs')
const path = require('path')
const inputFile = path.resolve(__dirname, '../input_file.txt')
const outputFile = path.resolve(__dirname, '../output_file.txt')
function bootstrap() {
// fs.exists is deprecated
// check if output file exists
// https://nodejs.org/api/fs.html#fs_fs_exists_path_callback
fs.exists(outputFile, (exists) => {
if (exists) {
// output file exists, delete it
// https://nodejs.org/api/fs.html#fs_fs_unlink_path_callback
fs.unlink(outputFile, (err) => {
if (err) {
throw err
}
console.info(`successfully deleted: ${outputFile}`)
checkInputFile()
})
} else {
// output file doesn't exist, move on
checkInputFile()
}
})
}
function checkInputFile() {
// check if input file can be read
// https://nodejs.org/api/fs.html#fs_fs_access_path_mode_callback
fs.access(inputFile, fs.constants.R_OK, (err) => {
if (err) {
// file can't be read, throw error
throw err
}
// file can be read, move on
loadInputFile()
})
}
function saveToOutput() {
// create write stream
// https://nodejs.org/api/fs.html#fs_fs_createwritestream_path_options
const stream = fs.createWriteStream(outputFile, {
flags: 'w'
})
// return wrapper function which simply writes data into the stream
return (data) => {
// check if the stream is writable
if (stream.writable) {
if (data === null) {
stream.end()
} else if (data instanceof Array) {
stream.write(data.join('\n'))
} else {
stream.write(data)
}
}
}
}
function parseLine(line, respond) {
respond([line])
}
function loadInputFile() {
// create write stream
const saveOutput = saveToOutput()
// create read stream
// https://nodejs.org/api/fs.html#fs_fs_createreadstream_path_options
const stream = fs.createReadStream(inputFile, {
autoClose: true,
encoding: 'utf8',
flags: 'r'
})
let buffer = null
stream.on('data', (chunk) => {
// append the buffer to the current chunk
const lines = (buffer !== null)
? (buffer + chunk).split('\n')
: chunk.split('\n')
const lineLength = lines.length
let lineIndex = -1
// save last line for later (last line can be incomplete)
buffer = lines[lineLength - 1]
// loop trough all lines
// but don't include the last line
while (++lineIndex < lineLength - 1) {
parseLine(lines[lineIndex], saveOutput)
}
})
stream.on('end', () => {
if (buffer !== null && buffer.length > 0) {
// parse the last line
parseLine(buffer, saveOutput)
}
// Passing null signals the end of the stream (EOF)
saveOutput(null)
})
}
// kick off the parsing process
bootstrap()
I know this is old but...
At a guess appendFileSync() _write()_s to the file system and waits for the response. Lots of small writes are generally expensive, presuming you use a BufferedWriter in Java you might get faster results by skipping some write()s.
Use one of the async writes and see if node buffers sensibly, or write the lines to large node Buffer until it is full and always write a full (or nearly full) Buffer. By tuning the buffer size you could validate if the number of writes affects perf. I suspect it would.
The execution is slow, because you're not using node's asynchronous operations. In essence, you're executing the code like this:
> read some lines
> transform
> write some lines
> repeat
While you could be doing everything at once, or at least reading and writing. Some examples in the answers here do that, but the syntax is at least complicated. Using scramjet you can do it in a couple simple lines:
const {StringStream} = require('scramjet');
fs.createReadStream(path, {bufferSize: 128 * 4096})
.pipe(new StringStream({maxParallel: 128}) // I assume this is an utf-8 file
.split("\n") // split per line
.parse((line) => line.split(';')) // parse line
.map([line_arr, range_start, range_end, daynums] => {
return simplyReturnYourResultForTheOtherFileHere(
line_arr, range_start, range_end, daynums
); // run your code, return promise if you're doing some async work
})
.stringify((result) => result.toString())
.pipe(fs.createWriteStream)
.on("finish", () => console.log("done"))
.on("error", (e) => console.log("error"))
This will probably run much faster.

Office task pane app: how to get the whole document in an OOXml string?

I'm developing an Office Task Pane app that needs to access the whole document. I know there is an API getFileAsync()
https://msdn.microsoft.com/en-us/library/office/jj220084.aspx
Office.context.document.getFileAsync(fileType [, options], callback);
However,the fileType can only be three values: compressed, pdf, text.
compressed
Returns the entire document (.pptx or .docx) in Office Open XML (OOXML) format as a byte array.
pdf
Returns the entire document in PDF format as a byte array.
text
Returns only the text of the document as a string. (Word only)
When it is compressed, the returned value is a byte array.
How can I get an OOXml string?
Or is there an API to select all content in a document so that I can use the getSelectedDataAsync() API?
In case someone finds this thread, I managed to solve this using zip.js.
var dataByteArray = [];
function getDocumentAsOoxml() {
Office.context.document.getFileAsync("compressed", { sliceSize: 100000 }, function (result) {
if (result.status == Office.AsyncResultStatus.Succeeded) {
// Get the File object from the result.
var myFile = result.value;
var state = {
file: myFile,
counter: 0,
sliceCount: myFile.sliceCount
};
getSlice(state);
}
});
}
function getSlice(state) {
state.file.getSliceAsync(state.counter, function (result) {
if (result.status == Office.AsyncResultStatus.Succeeded) {
readSlice(result.value, state);
}
});
}
function readSlice(slice, state) {
var data = slice.data;
// If the slice contains data, create an HTTP request.
if (data) {
dataByteArray = dataByteArray.concat(data);
state.counter++;
if (state.counter < state.sliceCount) {
getSlice(state);
} else {
closeFile(state);
}
}
}
function closeFile(state) {
// Close the file when you're done with it.
state.file.closeAsync(function (result) { });
// convert from byte array to blob that can bre read by zip.js
var byteArray = new Uint8Array(dataByteArray);
var blob = new Blob([byteArray]);
// Load zip.js library
$.getScript("/Scripts/zip.js/zip.js", function () {
zip.workerScriptsPath = "/Scripts/zip.js/";
// use a BlobReader to read the zip from a Blob object
zip.createReader(new zip.BlobReader(blob), function (reader) {
// get all entries from the zip file
reader.getEntries(function (entries) {
if (entries.length > 0) {
for (var i = 0; i < entries.length; i++) {
var entry = entries[i];
// find the file you are looking for
if (entry.filename == 'word/document.xml') {
entry.getData(new zip.TextWriter(), function (text) {
// text contains the entry data as a String
doSomethingWithText(text);
// close the zip reader
reader.close(function () {
// onclose callback
});
}, function (current, total) {
// onprogress callback
});
break;
}
}
}
});
}, function (error) {
// onerror callback
});
});
}
Hopefully there will be a easier way in the future.
this is a little late.
I've been working with Task Pane apps lately and, as it turns out, OOXML is natively compressed (unless I'm grossly mistaken).
My best advice would be to figure out the encoding that the string is encoded at, and decode with that encoding type. I'm willing to bet that it's UTF-8.

Resources