Long running processes on App Engine with Node.js - node.js

I have a Node.js web scraper that might take over an hour to run. It times out trying to run on App Engine standard environment. What's the best way to deploy it?
Also, it it triggered to run once per day with cron.yaml which hits an Express route. Is there a better way to do this?
Here is a simplified snippet of the code. I can run it locally, and deploy it to App Engine. It runs fine with a small amount of links in dlLinkArray. But with a larger amount (thousands) it doesn't seem to do anything. Usage reports show that it runs for a few seconds.
const Storage = require('#google-cloud/storage');
const storage = new Storage();
function startDownload(){
dlLinkArray = [/*Array of objects with URL and Filename {link: 'http://source.com', filename: 'file123456'} */]; //About 10,000 links/files
var promises = [];
dlLinkArray.forEach(record =>{ //create array of nested promises
promises.push(
uploadFile(bucketName, record.link, record.filename)
.then((x) => {
if(x[1].name) //rename file from whatever is on the remote server to a usefull ID
return renameFile(bucketName, x[1].name, record.filename + ".pdf"); //renameFile uses storage.file.move to rename, returns a promise
else
return x;
})
);
});
return Promise.all(promises);
}
function uploadFile(bucketName, fileURL, reName) {
// Uploads a remove file to the Cloud Storage bucket
return storage
.bucket(bucketName)
.upload(fileURL, {
gzip: true,
metadata: {
cacheControl: 'public, max-age=31536000',
},
});
}
/*Express Route*/
app.get('/api/whatever/download', (req, res) => {
buckets2.startDownload().then(() => console.log("DONE"));
res.status(200).send("Download Started");
});

I suspect that the problem might occur due to the request deadline. For the App Engine standard it is set at 60 seconds by default. However, if you use manual scaling, requests can run up to 24 hours in the standard environment.

Related

How can I speed up firebase function request?

I just migrated my website over to firebase from localhost and its working fine however I do have the issue of my firebase functions taking a pretty significant amount of time to resolve. One of the core features is pulling files from a google cloud bucket which was taking only 3 seconds on localhost and is now taking around x3 as long after migrating.
Is there anyway for me to speed up my firebase function query time? If not then is there a way for me to at least wait for a request to resolve before redirecting to a new page?
Here is the code for pulling the file in case it helps at all.
app.get('/gatherFromStorage/:filepath',async(req,res) => {
try{
const {filepath} = req.params;
const file = bucket.file(filepath)
let fileDat = []
const newStream = file.createReadStream()
newStream.setEncoding('utf8')
.on('data',function(chunk){
fileDat.push(chunk)
})
.on('end', function() {
console.log('done')
res.json(fileDat)
})
.on('error',function(err){
console.log(err)
});
}catch(error){
res.status(500).send(error)
console.log(error)
}
})
Also this question may come off as silly but I just don't know the answer. When I create a express endpoint should each endpoint be its own firebase function or is it fine that I wrap all my endpoints into one firebase function?

Download file from from GCP Storage bucket is very slow with npm library #google-cloud/storage in nodejs/typescript app

I'm downloading files from GCP Cloud Storage bucket from a NodeJS/Express app with Typescript, using the official library #google-cloud/storage. I'm running application locally, inside docker image that runs on docker-compose. Standard local environment, I guess.
The problem is that the file download takes so much time, and I really don't understand why is so.
In fact, I tried to download files with GCP REST API (through media link url), using simple fetch request: in this case, everything goes well and the download time is ok.
Here below download time comparison for a couple of files of different dimensions:
1KB: #google-cloud/storage 621 ms, fetch 224 ms
587KB: #google-cloud/storage 4.1 s, fetch 776 ms
28MB: #google-cloud/storage 2 minutes and 4 seconds, fetch 4 s
#google-cloud/storage authentication is managed through GOOGLE_APPLICATION_CREDENTIALS environment variable. I have the same problems with both 5.8.5 and 5.14.0 versions of the #google-cloud/storage library.
Precisely, I need to get the file as buffer in order to directly manage its content in the Node application, below the code.
import fetch from 'node-fetch'
import { Storage as GoogleCloudStorageLibrary } from '#google-cloud/storage'
export interface GoogleCloudStorageDownload {
fileBuffer: Buffer;
fileName: string;
}
// this method takes long time to retrieve the file and resolve the promise
const downloadBufferFile = async (filePath: string, originalName: string): Promise<GoogleCloudStorageDownload> => {
const storage = new GoogleCloudStorageLibrary()
const bucket = storage.bucket('...gcp_cloud_storage_bucket_name...')
return new Promise<GoogleCloudStorageDownload>((resolve, reject) => {
bucket
.file(filePath)
.download()
.then((data) => {
if (Array.isArray(data) && data.length > 0 && data[0]) {
resolve({ fileBuffer: data[0], fileName: originalName })
}
})
.catch((e) => {
if (e.code === 404) {
reject(new Error(`CloudStorageService - ${e.message} at path: ${filePath}`))
}
reject(new Error(`Error in downloading file from Google Cloud bucket at path: ${filePath}`))
})
})
}
// this method takes normal time to retrieve the file and resolve
const downloadBufferFileFetch = async (filePath: string, originalName: string): Promise<GoogleCloudStorageDownload> {
const fetchParams = {
headers: {
Authorization: 'Bearer ...oauth2_bearer_token...'
}
}
const fetchResponse = await fetch(filePath, fetchParams)
if (!fetchResponse.ok) {
throw new Error(`Error in fetch request: ${filePath}`)
}
const downloadedFile = await fetchResponse.buffer()
const result = {
fileBuffer: downloadedFile,
fileName: originalName
}
return result
}
const filePath = '...complete_file_path_at_gcp_bucket...'
const originalName = 'fileName.csv'
const slowResult = await downloadBufferFile(filePath, originalName)
const fastResult = await downloadBufferFileFetch(filePath, originalName)
The bucket has standard configuration.
You may suggest to just use the REST API with fetch but that should be not optimal and/or annoying, since I would have to manage the Authorization Bearer token and its refresh, for each environment the application will be running on.
Am I doing something wrong? What may be the cause of the very/extremely slow download?

Download from source and upload to gcloud using cloud function

I'm using a service called opentok which store video on their cloud and give me a callback url when file are ready so that i can download it and store on my cloud provider.
We use gcloud where we work and i need to download the file, then store it on my gcloud bucket with a firebase cloud functions.
Here is my code :
const archiveFile = await axios.get(
'https://sample-videos.com/video701/mp4/720/big_buck_bunny_720p_2mb.mp4'
);
console.log('file downloaded from opentokCloud !');
fs.writeFile('archive.mp4', archiveFile, err => {
if (err) throw err;
// success case, the file was saved
console.log('File Saved in container');
});
await firebaseBucket.upload('archive.mp4', {
gzip: true,
// destination: `archivedStreams/${archiveInfo.id}/archive.mp4`,
destination: 'test/lapin.mp4',
metadata: {
cacheControl: 'no-cache',
},
});
I tried to put directly the file downloaded in the upload() but it does not work i have to provide a String (path of my file)
How can i have the path of my downloaded file in the cloud function ? Is it still in the RAM of my container or in a cache folder ?
As you can see i tried to write with FS but i have no write access in the container of the cloud function
Thanks in advance to the community
If someone is looking for this in the futur here is how i solved it :
With Firebase cloud functions you can write temporary files in /tmp (see here for more information https://cloud.google.com/functions/docs/concepts/exec#file_system)
I solved the problem by using node-fetch package andwriteStream function of Node.JS :
const fetch = require('node-fetch');
const fs = require('fs');
await fetch(archiveInfo.url).then(res => {
console.log('start writing data');
const dest = fs.createWriteStream('/tmp/archive.mp4');
res.body.pipe(dest);
//Listen when the writing of the file is done
dest.on('finish', async () => {
console.log('start uploading in gcloud !');
await firebaseBucket.upload('/tmp/archive.mp4', {
gzip: true,
destination: `{pathToFileInBucket}/archive.mp4`,
metadata: {
cacheControl: 'no-cache',
},
});
console.log('uploading finished');
});
});
firebaseBucket is my gcloud bucket already configured elsewhere, define your own bucket using #google-cloud/storage
As my function where triggered by link, don't forget to put a response and catch errors to avoid timed out cloud functions (running for nothing, billed for nothing :D)

How to download files in /tmp folder of Google Cloud Function and then upload it in Google Cloud Storage

So I need to deploy a Google Cloud Function that allow me to make 2 things.
The first one is to DOWNLOAD any files on SFTP/FTP server on /tmp local directory of the Cloud Function. Then, the second step, is to UPLOAD this file in a bucket on the Google Cloud Storage.
Actually I know how to upload but I don't get how to DOWNLOAD files from ftp server to my local /tmp directory.
So, actually I have written a GCF that receive in parameters (on the body), the configuration (config) to allow me to connect on the FTP server, the filename and the path.
For my test I used the following ftp server test: https://www.sftp.net/public-online-sftp-servers with this configuration.
{
config:
{
hostname: 'test.rebex.net',
username: 'demo',
port: 22,
password: 'password'
},
filename: 'FtpDownloader.png',
path: '/pub/example'
}
After my DOWNLOAD, I start my UPLOAD. For that I check if I found the DOWNLOAD file in '/tmp/filename' before to UPLOAD but the file is nerver here.
See the following code:
exports.transferSFTP = (req, res) =>
{
let body = req.body;
if(body.config)
{
if(body.filename)
{
//DOWNLOAD
const Client = require('ssh2-sftp-client');
const fs = require('fs');
const client = new Client();
let remotePath
if(body.path)
remotePath = body.path + "/" + body.filename;
else
remotePath = "/" + body.filename;
let dst = fs.createWriteStream('/tmp/' + body.filename);
client.connect(body.config)
.then(() => {
console.log("Client is connected !");
return client.get(remotePath, dst);
})
.catch(err =>
{
res.status(500);
res.send(err.message);
})
.finally(() => client.end());
//UPLOAD
const {Storage} = require('#google-cloud/storage');
const storage = new Storage({projectId: 'my-project-id'});
const bucket = storage.bucket('my-bucket-name');
const file = bucket.file(body.filename);
fs.stat('/tmp/' + body.filename,(err, stats) =>
{
if(stats.isDirectory())
{
fs.createReadStream('/tmp/' + body.filename)
.pipe(file.createWriteStream())
.on('error', (err) => console.error(err))
.on('finish', () => console.log('The file upload is completed !!!'));
console.log("File exist in tmp directory");
res.status(200).send('Successfully executed !!!')
}
else
{
console.log("File is not on the tmp Google directory");
res.status(500).send('File is not loaded in tmp Google directory')
}
});
}
else res.status(500).send('Error: no filename on the body (filename)');
}
else res.status(500).send('Error: no configuration elements on the body (config)');
}
So, I received the following message: "File is not loaded in tmp Google directory" because after fs.stat() method, stats.isDirectory() is false. Before I use the fs.stats() method to check if the file is here, I have just writen files with the same filenames but without content.
So, I conclude that my upload work but without DONWLOAD files is really hard to copy it in the Google Cloud Storage.
Thanks for your time and I hope I will find a solution.
The problem is that your not waiting for the download to be completed before your code which performs the upload starts running. While you do have a catch() statement, that is not sufficient.
Think of the first part (the download) as a separate block of code. You have told Javascript to go off an do that block asynchronously. As soon as your script has done that, it immediately goes on to do the the rest of your script. It does not wait for the 'block' to complete. As a result, your code to do the upload is running before the download has been completed.
There are two things you can do. The first would be to move all the code which does the uploading into a 'then' block following the get() call (BTW, you could simplify things by using fastGet()). e.g.
client.connect(body.config)
.then(() => {
console.log("Client is connected !");
return client.fastGet(remotePath, localPath);
})
.then(() => {
// do the upload
})
.catch(err => {
res.status(500);
res.send(err.message);
})
.finally(() => client.end());
The other alternative would be to use async/await, which will make your code look a little more 'synchronous'. Something along the lines of (untested)
async function doTransfer(remotePath, localPath) {
try {
let client - new Client();
await client.connect(config);
await client.fastGet(remotePath, localPath);
await client.end();
uploadFile(localPath);
} catch(err) {
....
}
}
here is a github project that answers a similar issue to yours.
here they deploy a Cloud Function to download the file from the FTP and upload them directly to the bucket, skipping the step of having the temporal file.
The code works, the deployment way in this github is not updated so I'll put the deploy steps as I suggest and i verified they work:
Activate Cloud Shell and run:
Clone the repository from github: git clone https://github.com/RealKinetic/ftp-bucket.git
Change to the directory: cd ftp-bucket
Adapt your code as needed
Create a GCS bucket, if you dont have one already you can create one by gsutil mb -p [PROJECT_ID] gs://[BUCKET_NAME]
Deploy: gcloud functions deploy importFTP --stage-bucket [BUCKET_NAME] --trigger-http --runtime nodejs8
In my personal experience this is more efficient than having it in two functions unless you need to do some file editing within the same cloud function

This is a general expressjs running on node.js inside a docker container and on the cloud question

I have built two docker images. One with nginx that serves my angular web app and another with node.js that serves a basic express app. I have tried to access the express app from my browser in two different tabs at the same time.
In one tab the angular dev server (ng serve) serves up the web page. In the other tab the docker nginx container serves up the web page.
While accessing the node.js express app at the same time from both tabs the data starts to mix and mingle and the results returned to both tabs are a mix mash of the two requests (one from each browser tab)...
I'll try and make this more simple by showing my express app code here...but to answer this question you may not even need to know what the code is at all...so maybe check the question as stated below the code first.
'use strict';
/***********************************
GOOGLE GMAIL AND OAUTH SETUP
***********************************/
const fs = require('fs');
const {google} = require('googleapis');
const gmail = google.gmail('v1');
const clientSecretJson = JSON.parse(fs.readFileSync('./client_secret.json'));
const oauth2Client = new google.auth.OAuth2(
clientSecretJson.web.client_id,
clientSecretJson.web.client_secret,
'https://us-central1-labelorganizer.cloudfunctions.net/oauth2callback'
);
/***********************************
EXPRESS WITH CORS SETUP
***********************************/
const PORT = 8000;
const HOST = '0.0.0.0';
const express = require('express');
const cors = require('cors');
const cookieParser = require('cookie-parser');
const bodyParser = require('body-parser');
const whiteList = [
'http://localhost:4200',
'http://localhost:80',
'http://localhost',
];
const googleApi = express();
googleApi.use(
cors({
origin: whiteList
}),
cookieParser(),
bodyParser()
);
function getPageOfThreads(pageToken, userId, labelIds) {
return new Promise((resolve, reject) => {
gmail.users.threads.list(
{
'auth': oauth2Client,
'userId': userId,
'labelIds': labelIds,
'pageToken': pageToken
},
(error, response) => {
if (error) {
console.error(error);
reject(error);
}
resolve(response.data);
}
)
});
}
async function getPages(nextPageToken, userId, labelIds, result) {
while (nextPageToken) {
let pageOfThreads = await getPageOfThreads(nextPageToken, userId, labelIds);
console.log(pageOfThreads.nextPageToken);
pageOfThreads.threads.forEach((thread) => {
result = result.concat(thread.id);
})
nextPageToken = pageOfThreads.nextPageToken;
}
return result;
}
googleApi.post('/threads', (req, res) => {
console.log(req.body);
let threadIds = [];
oauth2Client.credentials = req.body.token;
let getAllThreadIds = new Promise((resolve, reject) => {
gmail.users.threads.list(
{ 'auth': oauth2Client, 'userId': 'me', 'maxResults': 500 },
(err, response) => {
if (err) {
console.error(err)
reject(err);
}
if (response.data.threads) {
response.data.threads.forEach((thread) => {
threadIds = threadIds.concat(thread.id);
});
}
if (response.data.nextPageToken) {
getPages(response.data.nextPageToken, 'me', ['INBOX'], threadIds).then(result => {
resolve(result);
}).catch((err) => {
console.error(err);
reject(err);
});
} else {
resolve(threadIds);
}
}
);
});
getAllThreadIds
.then((result) => {
res.send({ threadIds: result });
})
.catch((error) => {
res.status(500).send({ error: 'Request failed with error: ' + error })
});
});
googleApi.get('/', (req, res) => res.send('Hello World!'))
googleApi.listen(PORT, HOST);
console.log(`Running on http://${HOST}:${PORT}`);
The angular app makes a simple request to the express app and waits for the reply...which it properly receives...but when I try to make two requests at the exact same time data starts to get mixed together and results are given back to each browser tab from different accounts...
...and the question is... When running containers in the cloud is this kind of thing an issue? Does one need to spin up a new container for each client that wants to actively connect to the express service so that their data doesn't get mixed?
...or is this an issue I am seeing because the express app is being accessed from locally inside my machine? If two machines with two different ip address tried to access this express server at the same time would this sort of data mixing still be an issue or would each get back it's own set of results?
Is this why people use CaaS instead of IaaS solutions?
FYI: this is demo code and the data will not be actually going back to the consumer directly...plans are to have it placed into a database and then re-extracted from the database to download all of the metadata headers for each email.
-Thank you for your time
I can only clear up a small part of this question:
When running containers in the cloud is this kind of thing an issue?
No. Docker is not causing any of the quirky behaviour that you are describing.
Does one need to spin up a new container for each client?
A docker container generally can serve as much users as the application inside of it can. So as long as your application can handle a lot of users (and it should), you don't have to start the same application in multiple containers. That said, when you expect a very large number of customers, there exist docker tools like Docker Compose, Docker Swarm and a lot of alternatives that will enable you to scale up later. For now, you don't need to worry about this at all.
I think I may have found out the issue with my code...and this is actually very important if you are using the node.js googleapis client library...
It is entirely necessary to create a new oauth2Client for each request that comes in
const oauth2Client = new google.auth.OAuth2(
clientSecretJson.web.client_id,
clientSecretJson.web.client_secret,
'https://us-central1-labelorganizer.cloudfunctions.net/oauth2callback'
);
Problem:
When this oauth2Client is shared it is shared by each and every person that connects at the same time...So it is necessary to create a new one each and every time a user connects to my /threads endpoint so that they do not share the same memory space (i.e. access_token etc.) while the processing is done.
Setting the client secret etc. and creating the oauth2Client just once at the top and then simply resetting the token for each request leads to the conflicts mentioned above.
Solution:
For now simply moving the creation of this oauth2Client into each and every request that comes in makes this work properly.
Each client that connects to the service NEEDS to have their own newly created oauth2Client instance or these types of conflicts will occur...
...it's kind of a no brainer but I still find it odd that there is nothing about this in the docs? and their own examples (https://github.com/googleapis/google-api-nodejs-client) seem to show only one instance being created for the whole of their app...but those examples are snippets so...

Resources