Creating an empty file of a certain size? - node.js

How can we create an empty file of certain size? I have a requirement where I need to create an empty file (i.e. file filled with zero bytes). In order to do so, this is the approach I am currently taking:
I create an empty file of zero byte size.
Next I keep on appending zero bytes buffer (max 2 GB at a time) to that file till I reach the desired size.
Here's the code I am using currently:
const createEmptyFileOfSize = (fileName, size) => {
return new Promise((resolve, reject) => {
try {
//First create an empty file.
fs.writeFile(fileName, Buffer.alloc(0), (error) => {
if (error) {
reject(error);
} else {
let sizeRemaining = size;
do {
const chunkSize = Math.min(sizeRemaining, buffer.kMaxLength);
const dataBuffer = Buffer.alloc(chunkSize);
try {
fs.appendFileSync(fileName, dataBuffer);
sizeRemaining -= chunkSize;
} catch (error) {
reject(error);
}
} while (sizeRemaining > 0);
resolve(true);
}
});
} catch (error) {
reject(error);
}
});
};
While this code works and I am able to create very large files (though it takes significant time to create an empty large file [roughly about 5 seconds to create a 10 GB file]) however I am wondering if there's a better way of accomplishing this.

There is no need to write anything in the file. All you have to do is to open the file for writing ('w' means "create if doesn't exist, truncate if exists") and write at least one byte to the offset you need. If the offset is larger than the current size of the file (when it exists) the file is extended to accommodate the new offset.
Your code should be as simple as this:
const fs = require('fs');
const createEmptyFileOfSize = (fileName, size) => {
return new Promise((resolve, reject) => {
fh = fs.openSync(fileName, 'w');
fs.writeSync(fh, 'ok', Math.max(0, size - 2));
fs.closeSync(fh);
resolve(true);
});
};
// Create a file of 1 GiB
createEmptyFileOfSize('./1.txt', 1024*1024*1024);
Please note that the code above doesn't handle the errors. It was written to show an use case. Your real production code should handle the errors (and reject the promise, of course).
Read more about fs.openSync(), fs.writeSync() and fs.closeSync().
Update
A Promise should do its processing asynchronously; the executor function passed to the constructor should end as soon as possible, leaving the Promise in the pending state. Later, when the processing completes, the Promise will use the resolve or reject callbacks passed to the executor to change its state.
The complete code, with error handling and the correct creation of a Promise could be:
const fs = require('fs');
const createEmptyFileOfSize = (fileName, size) => {
return new Promise((resolve, reject) => {
// Check size
if (size < 0) {
reject("Error: a negative size doesn't make any sense")
return;
}
// Will do the processing asynchronously
setTimeout(() => {
try {
// Open the file for writing; 'w' creates the file
// (if it doesn't exist) or truncates it (if it exists)
fd = fs.openSync(fileName, 'w');
if (size > 0) {
// Write one byte (with code 0) at the desired offset
// This forces the expanding of the file and fills the gap
// with characters with code 0
fs.writeSync(fd, Buffer.alloc(1), 0, 1, size - 1);
}
// Close the file to commit the changes to the file system
fs.closeSync(fd);
// Promise fulfilled
resolve(true);
} catch (error) {
// Promise rejected
reject(error);
}
// Create the file after the processing of the current JavaScript event loop
}, 0)
});
};

Related

nodejs - Async generator/iterator with or without awaiting long operation

I'm trying to understand which setup is the best for doing the following operations:
Read line by line a CSV file
Use the row data as input of a complex function that at the end outputs a file (one file for each row)
When the entire process is finished I need to zip all the files generated during step 2
My goal: fast and scalable solution able to handle huge files
I've implemented step 2 using two approaches and I'd like to know what is the best and why (or if there are other better ways)
Step 1
This is simple and I rely on CSV Parser - async iterator API:
async function* loadCsvFile(filepath, params = {}) {
try {
const parameters = {
...csvParametersDefault,
...params,
};
const inputStream = fs.createReadStream(filepath);
const csvParser = parse(parameters);
const parser = inputStream.pipe(csvParser)
for await (const line of parser) {
yield line;
}
} catch (err) {
throw new Error("error while reading csv file: " + err.message);
}
}
Step 2
Option 1
Await the long operation handleCsvLine for each line:
// step 1
const csvIterator = loadCsvFile(filePath, options);
// step 2
let counter = 0;
for await (const row of csvIterator) {
await handleCvsLine(
row,
);
counter++;
if (counter % 50 === 0) {
logger.debug(`Processed label ${counter}`);
}
}
// step 3
zipFolder(folderPath);
Pro
nice to see the files being generated one after the other
since it wait for the operation to end I can show the progress nicely
Cons
it waits for each operation, can I be faster?
Option 2
Push the long operation handleCsvLine in an array and then after the loop do Promise.all:
// step 1
const csvIterator = loadCsvFile(filePath, options);
// step 2
let counter = 0;
const promises = [];
for await (const row of csvIterator) {
promises.push(handleCvsLine(row));
counter++;
if (counter % 50 === 0) {
logger.debug(`Processed label ${counter}`);
}
}
await Promise.all(promises);
// step 3
zipFolder(folderPath);
Pro
I do not wait, so it should be faster, isn't it?
Cons
since it does not wait, the for loop is very fast but then there is a long wait at the end (aka, bad progress experience)
Step 3
A simple step in which I use the archiver library to create a zip of the folder in which I saved the files from step 2:
function zipFolder(folderPath, globPath, outputFolder, outputName, logger) {
return new Promise((resolve, reject) => {
// create a file to stream archive data to.
const stream = fs.createWriteStream(path.join(outputFolder, outputName));
const archive = archiver("zip", {
zlib: { level: 9 }, // Sets the compression level.
});
archive.glob(globPath, { cwd: folderPath });
// good practice to catch warnings (ie stat failures and other non-blocking errors)
archive.on("warning", function (err) {
if (err.code === "ENOENT") {
logger.warning(err);
} else {
logger.error(err);
reject(err);
}
});
// good practice to catch this error explicitly
archive.on("error", function (err) {
logger.error(err);
reject(err);
});
// pipe archive data to the file
archive.pipe(stream);
// listen for all archive data to be written
// 'close' event is fired only when a file descriptor is involved
stream.on("close", function () {
resolve();
});
archive.finalize();
});
}
Not using await does not make operations faster. It will not wait for the response and will move to the next operation. It will keep adding operations to the event queue, with or without await.
You should use child_process instead to mock parallel processing. Node js is not multithreaded but you can achieve it using child_process, which runs on CPU cores. This way, you can generate multiple files at a time based on number of CPU cores available in the system.

Nodejs write multiple dynamically changing files with fs writefile

I need to write multiple dynamically changing files based on an array consisting of objects passed to a custom writeData() function. This array consists of objects containing the file name and the data to write as shown below:
[
{
file_name: "example.json",
dataObj,
},
{
file_name: "example2.json",
dataObj,
},
{
file_name: "example3.json",
dataObj,
},
{
file_name: "example4.json",
dataObj,
},
];
My current method is to map this array and read + write new data to each file:
array.map((entry) => {
fs.readFile(
entry.file_name,
"utf8",
(err, unparsedData) => {
if (err) console.log(err);
else {
var parsedData = JSON.parse(unparsedData);
parsedData.data.push(entry.dataObj);
const parsedDataJSON = JSON.stringify(parsedData, null, 2);
fs.writeFile(
entry.file_name,
parsedDataJSON,
"utf8",
(err) => {
if (err) console.log(err);
}
);
}
}
);
});
This however, does not work. Only a small percent of data is written to these files and often times the file is not correctly written in json format (I think this is because two writeFile processes are writing to the same file at once and that breaks the file). Obviously this does not work the way I expected it to.
The multiple ways I have tried to resolve this problem consisted of attempting to make the fs.writeFile synchronous (delay the map loop, allowing each process to finish before moving to the next entry), but this is not a good practice as synchronous processes hang up the entire app. I have also looked into implementing promises but to no avail. I am a new learner to nodejs so apologies for missed details/information. Any help is appreciated!
The same file is often listed multiple times in the array if that changes anything.
Well, that changes everything. You should have shown that in the original question. If that is the case, then you have to sequence each individual file in the loop so it finishes one before advancing to the next. To prevent conflicts between writing to the same file, you have to assure yourself of two things:
You sequence each of the files in the loop so the next one doesn't start until the previous one is done.
You don't call this code again while its still in operation.
You can assure yourself of the first item like this:
async function processFiles(array) {
for (let entry of array) {
const unparsedData = await fs.promises.readFile(entry.file_name, "utf8");
const parsedData = JSON.parse(unparsedData);
parsedData.data.push(entry.dataObj);
const json = JSON.stringify(parsedData, null, 2);
await fs.promise.writeFile(entry.file_name, json, "utf8");
}
}
This will abort the loop if it gets an error on any of them. If you want it to continue to write the others, you can add a try/catch internally:
async function processFiles(array) {
let firstError;
for (let entry of array) {
try {
const unparsedData = await fs.promises.readFile(entry.file_name, "utf8");
const parsedData = JSON.parse(unparsedData);
parsedData.data.push(entry.dataObj);
const json = JSON.stringify(parsedData, null, 2);
await fs.promise.writeFile(entry.file_name, json, "utf8");
} catch (e) {
// log error and continue with the rest of the loop
if (!firstError) {
firstError = e;
}
console.log(e);
}
}
// make sure we communicate back any error that happened
if (firstError) {
throw firstError;
}
}
To assure yourself of the second point above, you will have to either not use a setInterval() (replace it with a setTimeout() that you set when the promise that processFiles()resolves or make absolutely sure that the setInterval() time is long enough that it will never fire before processFiles() is done.
Also, make absolutely sure that you are not modifying the array used in this function while that function is running.

Node.js - want 5 parallel calls to a method in a loop

I have 1000 files of information in MongoDB collection. I am writing a query to fetch 1000 records and in a loop, I am calling a function to download that file to local system. So, it's a sequential process to download all 1000 files.
I want some parallelism in the downloading process. In the loop, I want to download 10 files at a time, meaning I want to call download function 10 times, after completing 10 file downloads I want to download next 10 files (that means I need to call download function 10 times).
How can I achieve this parallelism OR is there any better way to do this?
I saw Kue npm, but how to achieve this? By the way I am downloading from FTP, so I am using basic-ftp npm for ftp operations.
The async library is very powerful for this, and quite easy too once you understand the basics.
I'd suggest that you use eachLimit so your app won't have to worry about looping through in batches of ten, it will just keep ten files downloading at the same time.
var files = ['a.txt', 'b.txt']
var concurrency = 10;
async.eachLimit(files, concurrency, downloadFile, onFinish);
function downloadFile(file, callback){
// run your download code here
// when file has downloaded, call callback(null)
// if there is an error, call callback('error code')
}
function onFinish(err, results){
if(err) {
// do something with the error
}
// reaching this point means the files have all downloaded
}
The async library will run downloadFile in parallel, sending each instance an entry from the files list, then when every item in the list has completed it will call onFinish.
Without seeing your implementation I can only provide a generic answer.
Let's say that your download function receives one fileId and returns a promise that resolves when said file has finished downloading. For this POC, I will mock that up with a promise that will resolve to the file name after 200 to 500 ms.
function download(fileindex) {
return new Promise((resolve,reject)=>{
setTimeout(()=>{
resolve(`file_${fileindex}`);
},200+300*Math.random());
});
}
You have 1000 files and want to download them in 100 iterations of 10 files each.
let's encapsulate stuff. I'll declare a function that receives the starting ID and a size, and returns [N...N+size] ids
function* range(bucket, size=10) {
let start = bucket*size,
end=start+size;
for (let i = start; i < end; i++) {
yield i;
}
}
You should create 100 "buckets" containing a reference to 10 files each.
let buckets = [...range(0,100)].map(bucket=>{
return [...range(bucket,10)];
});
A this point, the contents of buckets are:
[
[file0 ... file9]
...
[file 990 ... file 999]
]
Then, iterate over your buckets using for..of(which is async-capable)
On each iteration, use Promise.all to enqueue 10 calls to download
async function proceed() {
for await(let bucket of buckets) { // for...of
await Promise.all(bucket.reduce((accum,fileindex)=>{
accum.push(download(fileindex));
return accum;
},[]));
}
}
let's see a running example (just 10 buckets, we're all busy here :D )
function download(fileindex) {
return new Promise((resolve, reject) => {
let file = `file_${fileindex}`;
setTimeout(() => {
resolve(file);
}, 200 + 300 * Math.random());
});
}
function* range(bucket, size = 10) {
let start = bucket * size,
end = start + size;
for (let i = start; i < end; i++) {
yield i;
}
}
let buckets = [...range(0, 10)].map(bucket => {
return [...range(bucket, 10)];
});
async function proceed() {
let bucketNumber = 0,
timeStart = performance.now();
for await (let bucket of buckets) {
let startingTime = Number((performance.now() - timeStart) / 1000).toFixed(1).substr(-5),
result = await Promise.all(bucket.reduce((accum, fileindex) => {
accum.push(download(fileindex));
return accum;
}, []));
console.log(
`${startingTime}s downloading bucket ${bucketNumber}`
);
await result;
let endingTime = Number((performance.now() - timeStart) / 1000).toFixed(1).substr(-5);
console.log(
`${endingTime}s bucket ${bucketNumber++} complete:`,
`[${result[0]} ... ${result.pop()}]`
);
}
}
document.querySelector('#proceed').addEventListener('click',proceed);
<button id="proceed" >Proceed</button>

How to see broken file writes with node.js

My english is not so good so I hope to be clear.
Just for sake of curiosity I want to test multiple concurrent(*) file writes on the same file and see that t produce errors.
The manual is clear on that:
Note that it is unsafe to use fs.writeFile multiple times on the same
file without waiting for the callback. For this scenario,
fs.createWriteStream is strongly recommended.
So if I write a relatively big amount of data into a file and in the while it is still writing I'm call another file write on the same file without waiting for the callback.. I'm expecting some sort of error.
I tried to wrote a small example to test this situation but I can't manage to see any errors.
"use strict";
const fs = require('fs');
const writeToFile = (filename, data) => new Promise((resolve, reject) => {
fs.writeFile(filename, data, { flag: 'a' }, err => {
if (err) {
return reject(err);
}
return resolve();
});
});
const getChars = (num, char) => {
let result = '';
for (let i = 0; i < num; ++i) {
result += char;
}
return result + '\n';
};
const k = 10000000;
const data0 = getChars(k, 0);
const data1 = getChars(k, 1);
writeToFile('test1', data0)
.then(() => console.log('0 written'))
.catch(e => console.log('errors in write 0'));
writeToFile('test1', data1)
.then(() => console.log('1 written'))
.catch(e => console.log('errors in write 1'));
To test it instead of open the file with some editor (that is a little bit slow in my box) I use a linux command to see the end of the first buffer and the beginning of the second buffer (and that they do not overlap):
tail -c 10000010 test1 | grep 0
But I'm not sure it is the right way to test it.
Just to be clear I'm with node v6.2.2, and mac 10.11.6.
Does anyone over there can point me a small sketch that uses fs.writeFile that produce a wrong output?
(*) concurrent = don't wait for the end of one file write to begin the next one

Nodejs Read very large file(~10GB), Process line by line then write to other file

I have a 10 GB log file in a particular format, I want to process this file line by line and then write the output to other file after applying some transformations. I am using node for this operation.
Though this method is fine but it takes a hell lot of time to do this. I was able to do this within 30-45 mins in JAVA, but in node it is taking more than 160 minutes to do the same job. Following is the code:
Following is the initiation code which reads each line from the input.
var path = '../10GB_input_file.txt';
var output_file = '../output.txt';
function fileopsmain(){
fs.exists(output_file, function(exists){
if(exists) {
fs.unlink(output_file, function (err) {
if (err) throw err;
console.log('successfully deleted ' + output_file);
});
}
});
new lazy(fs.createReadStream(path, {bufferSize: 128 * 4096}))
.lines
.forEach(function(line){
var line_arr = line.toString().split(';');
perform_line_ops(line_arr, line_arr[6], line_arr[7], line_arr[10]);
}
);
}
This is the method that performs some operation over that line and
passes the input to write method to write it into the output file.
function perform_line_ops(line_arr, range_start, range_end, daynums){
var _new_lines = '';
for(var i=0; i<days; i++){
//perform some operation to modify line pass it to print
}
write_line_ops(_new_lines);
}
Following method is used to write data into a new file.
function write_line_ops(line) {
if(line != null && line != ''){
fs.appendFileSync(output_file, line);
}
}
I want to bring this time down to 15-20 mins. Is it possible to do so.
Also for the record I'm trying this on a intel i7 processor with 8 GB of RAM.
You can do this easily without a module. For example:
var fs = require('fs');
var inspect = require('util').inspect;
var buffer = '';
var rs = fs.createReadStream('foo.log');
rs.on('data', function(chunk) {
var lines = (buffer + chunk).split(/\r?\n/g);
buffer = lines.pop();
for (var i = 0; i < lines.length; ++i) {
// do something with `lines[i]`
console.log('found line: ' + inspect(lines[i]));
}
});
rs.on('end', function() {
// optionally process `buffer` here if you want to treat leftover data without
// a newline as a "line"
console.log('ended on non-empty buffer: ' + inspect(buffer));
});
I can't guess where the possible bottleneck is in your code.
Can you add the library or the source code of the lazy function?
How many operations does your perform_line_ops do? (if/else, switch/case, function calls)
I've created a example based on your given code, I know that this does not answer your question but maybe helps you understand how node handles such case.
const fs = require('fs')
const path = require('path')
const inputFile = path.resolve(__dirname, '../input_file.txt')
const outputFile = path.resolve(__dirname, '../output_file.txt')
function bootstrap() {
// fs.exists is deprecated
// check if output file exists
// https://nodejs.org/api/fs.html#fs_fs_exists_path_callback
fs.exists(outputFile, (exists) => {
if (exists) {
// output file exists, delete it
// https://nodejs.org/api/fs.html#fs_fs_unlink_path_callback
fs.unlink(outputFile, (err) => {
if (err) {
throw err
}
console.info(`successfully deleted: ${outputFile}`)
checkInputFile()
})
} else {
// output file doesn't exist, move on
checkInputFile()
}
})
}
function checkInputFile() {
// check if input file can be read
// https://nodejs.org/api/fs.html#fs_fs_access_path_mode_callback
fs.access(inputFile, fs.constants.R_OK, (err) => {
if (err) {
// file can't be read, throw error
throw err
}
// file can be read, move on
loadInputFile()
})
}
function saveToOutput() {
// create write stream
// https://nodejs.org/api/fs.html#fs_fs_createwritestream_path_options
const stream = fs.createWriteStream(outputFile, {
flags: 'w'
})
// return wrapper function which simply writes data into the stream
return (data) => {
// check if the stream is writable
if (stream.writable) {
if (data === null) {
stream.end()
} else if (data instanceof Array) {
stream.write(data.join('\n'))
} else {
stream.write(data)
}
}
}
}
function parseLine(line, respond) {
respond([line])
}
function loadInputFile() {
// create write stream
const saveOutput = saveToOutput()
// create read stream
// https://nodejs.org/api/fs.html#fs_fs_createreadstream_path_options
const stream = fs.createReadStream(inputFile, {
autoClose: true,
encoding: 'utf8',
flags: 'r'
})
let buffer = null
stream.on('data', (chunk) => {
// append the buffer to the current chunk
const lines = (buffer !== null)
? (buffer + chunk).split('\n')
: chunk.split('\n')
const lineLength = lines.length
let lineIndex = -1
// save last line for later (last line can be incomplete)
buffer = lines[lineLength - 1]
// loop trough all lines
// but don't include the last line
while (++lineIndex < lineLength - 1) {
parseLine(lines[lineIndex], saveOutput)
}
})
stream.on('end', () => {
if (buffer !== null && buffer.length > 0) {
// parse the last line
parseLine(buffer, saveOutput)
}
// Passing null signals the end of the stream (EOF)
saveOutput(null)
})
}
// kick off the parsing process
bootstrap()
I know this is old but...
At a guess appendFileSync() _write()_s to the file system and waits for the response. Lots of small writes are generally expensive, presuming you use a BufferedWriter in Java you might get faster results by skipping some write()s.
Use one of the async writes and see if node buffers sensibly, or write the lines to large node Buffer until it is full and always write a full (or nearly full) Buffer. By tuning the buffer size you could validate if the number of writes affects perf. I suspect it would.
The execution is slow, because you're not using node's asynchronous operations. In essence, you're executing the code like this:
> read some lines
> transform
> write some lines
> repeat
While you could be doing everything at once, or at least reading and writing. Some examples in the answers here do that, but the syntax is at least complicated. Using scramjet you can do it in a couple simple lines:
const {StringStream} = require('scramjet');
fs.createReadStream(path, {bufferSize: 128 * 4096})
.pipe(new StringStream({maxParallel: 128}) // I assume this is an utf-8 file
.split("\n") // split per line
.parse((line) => line.split(';')) // parse line
.map([line_arr, range_start, range_end, daynums] => {
return simplyReturnYourResultForTheOtherFileHere(
line_arr, range_start, range_end, daynums
); // run your code, return promise if you're doing some async work
})
.stringify((result) => result.toString())
.pipe(fs.createWriteStream)
.on("finish", () => console.log("done"))
.on("error", (e) => console.log("error"))
This will probably run much faster.

Resources