Running node js export in google cloud function - node.js

We need to export a zip file, containing lots of data (a couple of gb). The zip archive needs to contain about 50-100 indesign files (each about 100mb) and some other smaller files. We try to use google cloud functions to achieve it (less costs etc.) The function is triggered via a config file, which is uploaded into a bucket. The config file contains all information which files needs to be put into the zip. Unfortunately the memory limit of 2gb is always reached, so the function never succeeds.
We tried different things:
First solution was to loop over the files, create promises to download them and after the loop is done we tried to resolve all promises at once. (files are downloaded via streaming directly into a file).
Second try was to await every download inside the for loop, but again, memory limit reached.
So my question is:
Why does node js not clear the streams? It seems like node keeps every streamed file in memory and finally crashes. I already tried to set the readStream and writeStream to null as suggested here:
How to prevent memory leaks in node.js?
But no change.
Note: We never reached the point, there all files are downloaded to create the zip file. It always failed after the first files.
See below the code snippets:
// first try via promises all:
const promises = []
for (const file of files) {
promises.push(downloadIndesignToExternal(file, 'xxx', dir));
}
await Promise.all(promises)
// second try via await every step (not performant in terms of execution time, but we wanted to know if memory limit is also reached:
for (const file of files) {
await downloadIndesignToExternal(file, 'xxx', dir);
}
// code to download indesign file
function downloadIndesignToExternal(activeId, externalId, dir) {
return new Promise((resolve, reject) => {
let readStream = storage.bucket(INDESIGN_BUCKET).file(`${activeId}.indd`).createReadStream()
let writeStream = fs.createWriteStream(`${dir}/${externalId}.indd`);
readStream.pipe(writeStream);
writeStream.on('finish', () => {
resolve();
});
writeStream.on('error', (err) => {
reject('Could not write file');
})
})
}

It's important to know that /tmp (os.tmpdir()) is a memory-based filesystem in Cloud Functions. When you download a file to /tmp, it is taking up memory just as if you had saved it to memory in a buffer.
If your function needs more memory than can be configured for a function, then Cloud Functions might not be the best solution to this problem.
If you still want to use Cloud Functions, you will have to find a way to stream the input files directly to the output file, but without saving any intermediate state in the function. I'm sure this is possible, but you will probably need to write a fair amount of extra code for this.

For anyone interested:
We got it working by streaming the files into the zip and streaming it directly into google cloud storage. Memory usage is now by around 150-300mb, so this works perfectly for us.

Related

Unable to use one readable stream to write to two different targets in Node JS

I have a client side app where users can upload an image. I receive this image in my Node JS app as readable data and then manipulate it before saving like this:
uploadPhoto: async (server, request) => {
try {
const randomString = `${uuidv4()}.jpg`;
const stream = Fse.createWriteStream(`${rootUploadPath}/${userId}/${randomString}`);
const resizer = Sharp()
.resize({
width: 450
});
await data.file
.pipe(resizer)
.pipe(stream);
This works fine, and writes the file to the projects local directory. The problem comes when I try to use the same readable data again in the same async function. Please note, all of this code is in a try block.
const stream2 = Fse.createWriteStream(`${rootUploadPath}/${userId}/thumb_${randomString}`);
const resizer2 = Sharp()
.resize({
width: 45
});
await data.file
.pipe(resizer2)
.pipe(stream2);
The second file is written, but when I check the file, it seems corrupted or didn't successfully write the data. The first image is always fine.
I've tried a few things, and found one method that seems to work but I don't understand why. I add this code just before the I create the second write stream:
data.file.on('end', () => {
console.log('There will be no more data.');
});
Putting the code for the second write stream inside the on-end callback block doesn't make a difference, however, if I leave the code outside of the block, between the first write stream code and the second write stream code, then it works, and both files are successfully written.
It doesn't feel right leaving the code the way it is. Is there a better way I can write the second thumb nail image? I've tried to use the Sharp module to read the file after the first write stream writes the data, and then create a smaller version of it, but it doesn't work. The file doesn't ever seem to be ready to use.
You have 2 alternatives, which depends on how your software is designed.
If possible, I would avoid to execute two transform operations on the same stream in the same "context", eg: an API endpoint. I would rather separate those two different tranform so they do not work on the same input stream.
If that is not possible or would require too many changes, the solution is to fork the input stream and the pipe it into two different Writable. I normally use Highland.js fork for these tasks.
Please also see my comments on how to properly handle streams with async/await to check when the write operation is finished.

How to download a directory's content via ftp using nodejs?

So, i am trying to download the contents of a directory via sftp using nodejs, and so far I am getting stuck with an error.
I am using the ssh2-sftp-client npm package and for the most part it works pretty well as i am able to connect to the server and list the files in a particular remote directory.
Using the fastGet method to download a file also works without any hassles, and since all the methods are promise based i assumed i could easily download all the files in the directory simply enough, by doing something like:
let main = async () => {
await sftp.connect(config.sftp);
let data = await sftp.list(config.remote_dir);
if (data.length) data.map(async x => {
await sftp.fastGet(`${config.remote_dir}/${x.name}`, config.base_path + x.name);
});
}
So it turns out the code above successfully downloads the first file, but then crashes with the following error message:
Error: Failed to get sandbox/demo2.txt: The requested operation cannot be performed because there is a file transfer in progress.
This seems to indicate that the promise from fastGet is resolving too early as the file transfer is supposed to be over when the next element of the file list is processed.
I tried to use the more traditional get() instead but it is using streams, and it fails with a different error. After researching it seems there's been a breaking change regarding streams in node 10.x. well in my case calling get simply fails (not even downloading the first file).
Does anyone know a workaround to this? or else, another package that can download several files by sftp?
Thanks!
I figured out, since the issue was concurrent download attempts on one client connection, i could try to manage it with one client per file download. I ended up with the following recursive function.
let getFromFtp = async (arr) => {
if (arr.length == 0) return (processFiles());
let x = arr.shift();
conns.push(new Client());
let idx = conns.length - 1;
await conns[idx].connect(config.sftp.auth);
await conns[idx]
.fastGet(`${config.sftp.remote_dir}/${x.name}`, `${config.dl_dir}${x.name}`);
await connections[idx].end();
getFromFtp(arr);
};
Notes about this function:
The array parameter is a list of files to download, presumably fetched using list() beforehand
conns was declared as an empty array and is used to contain our clients.
using array.prototype.shift(), to gradually deplete the array as we go through the file list
the processFiles() method is fired once all the files were downloaded.
this is just the POC version. of couse we need to add the error management to that.

Optimal method for nodejs to hand of image from database to browser

the end result that I need is to send multiple images to a web browser from a database.
The images are stored as blobs.
I know I can stream them out of the database and into a file and then I could just give the url to the file.
I also know I can hand off base64 string to the browser so it can render the image.
My question is which option is the most optimal? Or best practice? Keep in mind that if I go the stream method, I would have to check to see if the image has changed since the last time I displayed it...and if it has changed then I have to restream it out of the database.
I have been playing with the oracldb for node js and was able to successfully extract one blob into a file but I am also having trouble streaming multiple files.
This is a two question post:
Which is the most optimal:
1. Send Base64 string - I kind of like this method because i dont have to worry about streaming out the file and checking if it has changed since it is coming straight from the databse. My concern is can the browser/nodejs handle it? I know those strings can be very large. I could also be sending more than one image at a time.
Stream the blobs into files.
The second part question is how can i get multiple blobs out below is my code on streaming just one file, i found this example from github lobstream1.js
https://raw.githubusercontent.com/oracle/node-oracledb/master/examples/lobstream1.js
Focusing on the code:
// Stream a LOB to a file
var dostream = function(lob, cb) {
if (lob.type === oracledb.CLOB) {
console.log('Writing a CLOB to ' + outFileName);
lob.setEncoding('utf8'); // set the encoding so we get a 'string' not a 'buffer'
} else {
console.log('Writing a BLOB to ' + outFileName);
}
var errorHandled = false;
lob.on(
'error',
function(err) {
console.log("lob.on 'error' event");
if (!errorHandled) {
errorHandled = true;
lob.close(function() {
return cb(err);
});
}
});
lob.on(
'end',
function() {
console.log("lob.on 'end' event");
});
lob.on(
'close',
function() {
// console.log("lob.on 'close' event");
if (!errorHandled) {
return cb(null);
}
});
var outStream = fs.createWriteStream(outFileName);
outStream.on(
'error',
function(err) {
console.log("outStream.on 'error' event");
if (!errorHandled) {
errorHandled = true;
lob.close(function() {
return cb(err);
});
}
});
// Switch into flowing mode and push the LOB to the file
lob.pipe(outStream);
};
Fixed spooling out images with this method, I did change the dostream a bit.
for(var x = 0; x<result.rows.length;x++)
{
outputFileName = x + '.jpg';
console.log(outputFileName);
console.log(x);
var lob = result.rows[x][0];
dostream(lob,outputFileName);
// cb(null,lob);
}
Thank you for any help.
Given all the detail you provided in subsequent comments including the average image size, number of distinct images, memory available to Node.js, number of concurrent users, and the fact that it's "very critical to have the images up to date", here's my initial take...
For the first implementation, stick to the KISS principle and avoid over-engineering. Disable browser caching and don't cache images in Node.js. Instead, rely on the driver and Oracle Database to do the heavy lifting for you.
As for the table storing the images, try to use SecureFile LOBs over BasicFile LOBs (they are known to perform better) if possible. Also, look at the caching options available to both (CACHE, CACHE READS, and NOCACHE). Consider enabling the CACHE READS option based on your stated workload, but work with your DBA to ensure the buffer cache is sized appropriately so you will not impact others.
You can rely on the connection pool's connection request queue to help control how many people are fetching files concurrently. In fact, you might want to create a separate pool just for this purpose so that people fetching LOBs aren't blocking people doing other things in the application. For example, let's say you normally have one connection pool with 10 connections. You could create two connection pools with 5 connections each (use the connection pool cache to make this easy). Then, in the code path that fetches lobs, use the lob pool and use the other pool for everything else.
Given this setup, I'd also recommend NOT streaming the LOBs. Using the driver's ability to buffer the LOBs in Node.js will greatly simplify the code and you should have plenty of memory given such a small number of concurrent users/file fetches.
The biggest problem with this scenario that the images are pretty large and they'll always be flowing from the database through Node.js to the browser. But since you'll be on an internal network, this might not be much of a problem. If it does turn out to be a problem, you can start to add caching in either the browser or Node.js based on what makes the most sense.
Unless you do something like tiling or the base64 inline encoding, each image needs its own URL, so each invocation of node-oracledb would return just one image. You could do some kind of caching by writing to disk, but this seems extra IO - you will need to test to measure your own system's performance and memory requirements. Regarding accessing multiple images in node-oracledb there's some code in https://github.com/oracle/node-oracledb/issues/1041#issuecomment-459002641 that may be useful.

How can I stream multiple remote images to a zip file and stream that to browser with ExpressJS?

I've got a small web app built in ExpressJs that allows people in our company to browse product information. A recent feature request requires that users be able to download batches of images (potentially hundreds at a time). These are stored on another server.
Ideally I think I need to to stream the batch of files to a zip file and stream that to the end user's browser as a download. All preferably without having to store the files on the server. The idea being that I want to reduce load on the server as much as possible.
Is it possible to do this or do I need to look at another approach? I've been experimenting with the 'request' module for the initial download.
If anyone can point me in the right direction or recommend any NPM modules that might help it would be very much appreciated.
Thanks.
One useful module for this is archiver, but I'm sure there are others as well.
Here's an example program that shows:
how to retrieve a list of URL's (I'm using async to handle the requests, and also to limit the # of concurrent HTTP requests to 3);
how to add the responses for those URL's to a ZIP file;
to stream the final ZIP file somewhere (in this case to stdout, but in case of Express you can pipe to the response object).
Example:
var async = require('async');
var request = require('request');
var archiver = require('archiver');
function zipURLs(urls, outStream) {
var zipArchive = archiver.create('zip');
async.eachLimit(urls, 3, function(url, done) {
var stream = request.get(url);
stream.on('error', function(err) {
return done(err);
}).on('end', function() {
return done();
});
// Use the last part of the URL as a filename within the ZIP archive.
zipArchive.append(stream, { name : url.replace(/^.*\//, '') });
}, function(err) {
if (err) throw err;
zipArchive.finalize().pipe(outStream);
});
}
zipURLs([
'http://example.com/image1.jpg',
'http://example.com/image2.jpg',
...
], process.stdout);
Do note that although this doesn't require the image files to be locally stored, it does build the ZIP file entirely in memory. Perhaps there are other ZIP modules that would allow you to work around that, although (AFAIK) the ZIP file format isn't really great in terms of streaming, as it depends on metadata being appended to the end of the file.

how to read an incomplete file and wait for new data in nodejs

I have a UDP client that grabs some data from another source and writes it to a file on the server. Since this is large amount of data, I dont want the end user to wait until they its full written to the server so that they can download it. So I made a NodeJS server that grabs the latest data from the file and sends it to the user.
Here is the code:
var stream = fs.readFileSync(filename)
.on("data", function(data) {
response.write(data)
});
The problem here is, if the download starts when the file was only for example 10mb.. the fs.readFileSync will only read my file up to 10mb. Even if 2 mins later the file increased to 100mb. fs.readFileSync will never know about the new updated data. How can I do this in Node? I would like somehow refresh the fs state or maybe perpaps wait for new data using fs file system. Or is there some kind of fs fileContent watcher?
EDIT:
I think the code below describes better what I would like to achieve, however in this code it keeps reading forever and I dont have any variable from fs.read that can help me stop it:
fs.open(filename, 'r', function(err, fd) {
var bufferSize=1000,
chunkSize=512,
buffer=new Buffer(bufferSize),
bytesRead = 0;
while(true){ //check if file has new content inside
fs.read(fd, buffer, 0, chunkSize, bytesRead);
bytesRead+= buffer.length;
}
});
Node has built-in methods in the fs module. It is tagged as unstable, so it can change in the future.
Its called: fs.watchFile(filename[, options], listener)
You can read more about it here: https://nodejs.org/api/fs.html#fs_fs_watchfile_filename_options_listener
But i highly suggest you to use one of the good modules mantained actively like
watchr:
From his readme:
Better file system watching for Node.js. Provides a normalised API the
file watching APIs of different node versions, nested/recursive file
and directory watching, and accurate detailed events for
file/directory changes, deletions and creations.
The module page is here: https://github.com/bevry/watchr
(Used the module in a couple of proyects and working great, im not related to it in other way)
you need store in some data base last size of file.
read filesize first.
load your file.
then make a script to check if file was change.
you can consult the size with jquery.post to obtain your result and decide if need to reload in javascript

Resources