Google Cloud Platform - Optimise Cloud Function using puppeteer (Node.js) - node.js

I have written a function in node.js that works well when I run it locally (~10s to run).
As I want to run it every hour, I have deployed it on Google Cloud Platform. But, there, I always have a TimeOut error.
Therefore, do you have any advice on:
what I should change in my function to make it more efficient?
a alternate way to automate my function so it runs every hour?
FYI my cloud function has the following characteristics:
Node js 8
Memory: 2Go
Timeout: 540 seconds
and the following form:
exports.launchSearch = (req, res) => {
const puppeteer = require('puppeteer');
const url = require('./pageInformation').url;
const pageLocation = require('./pageInformation').location;
const userInformation = require('./userInformation').information;
(async () => {
const browser = await puppeteer.launch({args: ['--no-sandbox']});
const page = await browser.newPage();
await page.goto(url);
// Part 1
await page.click(pageLocation['...']);
await page.type(pageLocation['...'], userInformation['...']);
await page.waitFor(pageLocation['...']);
await page.click(pageLocation['...']);
... ~20 other "page.click" or "page.select"
// Part 2
var continueLoop = true;
while (continueLoop) {
var list = await page.$x(pageLocation['...']);
if (list.length > 0) {
await list[0].click();
var found = true;
var continueLoop = false;
} else {
var afficher = await page.$x(pageLocation['...']);
if (afficher.length > 0) {
await afficher[0].click();
} else {
var continueLoop = false;
var found = false;
};
};
};
// Part 3
if (found) {
await page.waitForXPath(pageLocation['...']);
const xxx = await page.$x(pageLocation['...']);
await xxx[0].click();
... 5 other blocks with exact same 3 lines, but with other elements to click
};
await browser.close();
})();
};
I have tried to run it part by part; sometimes it times out at the end of Part 1, sometimes at the end of Part 2. But the whole script never entirely completed.

Without having too much context of what your code does, it is hard to point out the root cause, but that I tell is continue debugging your code as Horatio suggested, or you can use a more sophisticated tool like StackDriver to monitoring the performance of your Cloud Functions. Evaluate its pricing if you are interested in.
If Stackdriver is an overkill, simply make use of inline function wrapping to find out the exact place of your routine that consuming all that time. Here is an example:
var start = process.hrtime();
yourfunction();
var elapsed = process.hrtime(start)[1] / 1000000;
console.log("Elapsed:" + elapsed.toFixed(3));
Once you have the exact piece of code that is affecting the execution, then you probably may have to optimize it. Additionally, as I understand that locally it worked perfectly, consider that sometimes processes running in Cloud environment are affected by latency due the 'proximity' of the other resources they consume.
Regarding your second question, about automating your function to be executed every hour. You can take advantage of Cloud Scheduler. It has the capability to make scheduled calls to HTTP/HTTPS endpoints, which Cloud Functions classify as one of those. Make sure to check its pricing also.

Related

How to test the performance impact of a single script on a website?

I am trying to understand the impact of an additional script being added to a webpage.
My first attempt has been to use lighthouse, and to try and calculate metrics before and after the addition of the script:
const fs = require('fs');
const playwright = require('playwright');
const lighthouse = require('lighthouse');
const args = require('minimist')(process.argv.slice(2));
const BROWSER_CONTEXT_DIR = 'browserContextDir';
async function main(inject) {
const context = await playwright.chromium.launchPersistentContext(BROWSER_CONTEXT_DIR, {
args: [`--remote-debugging-port=8041`],
headless: false,
});
if (inject){
await context.addInitScript({
path: 'script.js'
})
}
const lhOptions = {
port: 8041,
onlyCategories: ['performance']
};
const result = await lighthouse(args.url, lhOptions);
fs.writeFileSync('lhreport.json', JSON.stringify(result.lhr.audits, null, 2));
await context.close();
fs.rmdirSync(BROWSER_CONTEXT_DIR, { recursive: true });
return {
speed: result.lhr.audits['speed-index'],
cpu: result.lhr.audits['first-cpu-idle'],
blocking: result.lhr.audits['total-blocking-time']
}
}
While this works, each run of lighthouse has too much variance in the results. It seems the proper way to go would be to do this in isolation, not via injecting into a live site.
What approaches can I take to measure the impact on CPU, memory and site speed of a 3rd party script?
You can run a test with WebPageTest, block one third-party script at a time and re-run the test. You can specify the script/s to block under Advanced Settings > Block
There is an excellent talk by Harry Roberts (csswizardry) that describes exactly what to do. In particular, the analysis with WebPageTest is described at around 16'.

Trying to run a Cloud Function with LRO

Background
I am working on creating an autonomous Google AutoML end<>end system. I created a cloud function that receives a cloud pub/sub message when training starts. The cloud function uses the operation ID to get the operation status of the training. If the training of the model is complete(operation metadata = true), the function will send the model ID to a deployment function and send a pub/sub message with the modelID for the model to be called on prediction from. I found a solution from SO from this post How to programmatically get model id from google-cloud-automl with node.js client library
Problem
The issue I am coming across is with the cloud function timeout of 10 minutes. I wrote this question on reddit on potential solutions. https://www.reddit.com/r/googlecloud/comments/jqr213/cloud_function_to_compute_engine/ The Compute Engine solution seems not practical for a system mainly written in a cloud function environment. While trying to implement the cron job solution, I thought of the retry feature for cloud functions. It keeps the same event and will retry the function for up to a week. The documentation for retry is https://cloud.google.com/functions/docs/bestpractices/retries How could I include a cancel of the function to keep it retrying until it becomes true and completes the deployment and pub/sub message? My thought is to include the ending of the system in the if else statement, I am just struggling to find documentation of this/ if it would actually work.
Code
const {AutoMlClient} = require('#google-cloud/automl').v1;
// Instantiates a client
const client = new AutoMlClient();
exports.helloPubSub = (event, context) => {
//Imports the Google Cloud AutoML library
const message = event.data
? Buffer.from(event.data, 'base64').toString()
: 'Hello, World';
const model = message;
console.log(model);
const modelpath = message.replace('"','');
const modelID = modelpath.replace('"','');
const message1 = model.replace('projects/170974376642/locations/us-central1/operations/','');
const message2 = message1.replace('"','');
const message3 = message2.replace('"','');
console.log(`Operation ID is: ${message3}`)
getOperationStatus(message3, modelID);
}
// [START automl_vision_classification_deploy_model_node_count]
async function getOperationStatus(opId, message) {
console.log('Starting operation status');
const opped = opId;
const data = message;
const projectId = '170974376642';
const location = 'us-central1';
const operationId = opId;
// Construct request
const request = {
name: `${message}`,
};
console.log('Made it to the response');
const [response] = await client.operationsClient.getOperation(request);
console.log(`Name: ${response.name}`);
console.log(`Operation details:`);
var apple = JSON.stringify(response);
console.log(apple);
console.log('Loop until the model is ready to deploy');
if (apple.includes('True')) {
const appleF = apple.replace((/projects\/[a-zA-Z0-9-]*\/locations\/[a-zA-Z0-9-]*\/models\//,''));
deployModelWithNodeCount(appleF);
pubSub(appleF);
} else {
getOperationStatus(opped, data);
}
}
async function pubSub(id) {
const topicName = 'modelID';
const data = JSON.stringify({foo: `${id}`});
async function publishMessage() {
// Publishes the message as a string, e.g. "Hello, world!" or JSON.stringify(someObject)
const dataBuffer = Buffer.from(data);
try {
const messageId = await pubSubClient.topic(topicName).publish(dataBuffer);
console.log(`Message ${messageId} published.`);
} catch (error) {
console.error(`Received error while publishing: ${error.message}`);
process.exitCode = 1;
}
}
publishMessage();
// [END pubsub_publish_with_error_handler]
// [END pubsub_quickstart_publisher]
process.on('unhandledRejection', err => {
console.error(err.message);
process.exitCode = 1;
});
}
async function deployModelWithNodeCount(message) {
const projectId = 'ireda1';
const location = 'us-central1';
const modelId = message;
// Construct request
const request = {
name: client.modelPath(projectId, location, modelId),
imageClassificationModelDeploymentMetadata: {
nodeCount: 1,
},
};
const [operation] = await client.deployModel(request);
// Wait for operation to complete.
const [response] = await operation.promise();
console.log(`Model deployment finished. ${response}`);
}
// [END automl_vision_classification_deploy_model_node_count]
There are several improvements that you can consider for your code. First of all, it is important to understand that Cloud Functions are short-lived. 9 minutes is the maximum, your function will be active. Cloud Functions are not meant for background operations, if you are looking at a solution, which can be executed in the background and requires minimal infrastructure, I would recommend having a look at Cloud Run.
Now lets have a look at some parts of the code and how it can be improved with a different architecture maintaining Cloud Functions and PubSub as the backbone.
Waiting on model deployment
The code you use is:
if (apple.includes('True')) {
const appleF = apple.replace((/projects\/[a-zA-Z0-9-]*\/locations\/[a-zA-Z0-9-]*\/models\//,''));
deployModelWithNodeCount(appleF);
pubSub(appleF);
} else {
getOperationStatus(opped, data);
}
First of all, I would strongly suggest not to use recursion here, because a) this can be handled via a simple loop, b) you are bombarding the service without any time out or back-off policy. The latter might result in either your service crashing or endpoint starting to reject your requests.
To improve your code, you can for example set at least timeout function, like this:
setTimeout(getOperationStatus(opped, data), 1000)
For readability, I would also suggest just to use a loop in the future since you are using async patterns anyways:
status = getOperationStatus(opped, data);
while(!status){
await new Promise(t => setTimeout(t, 1000));
status = getOperationStatus(opped, data);
}
In this case, you need to separate it into two functions - 1) getOperationStatus, which actually just return status, and 2) waitForDeployment, which polls for the status, compares it with the expected result, and decides to a) wait & retry or b) abandon & return
This might make your code better, but does not solve the fundamental problem of the system design. To understand this, let's have a look a splitting responsibility and structuring the system differently. As a side note, the guide here is not meant for a Cloud Function application.
A few explanations:
Activation Function initializes the entire process, it calls the Vision Auto ML to start the deployment. It only gets the ID of the operation and pushes it to the queue
Cloud Scheduler pushes a trigger to PubSub (alternatively it can also call the function as an endpoint) every X minutes/seconds saying that it is time to check on the progress
Polling Function once triggered ask for the next ID to check, queries Cloud AutoML and if finished, acknowledges the message and writes the results, otherwise exits. You need to be careful with the configuration of acknowledgments here. Useful information is here
Polling of the status
The minor thing I have noticed is how you are polling the status. Why don't your just query this URL GET https://automl.googleapis.com/v1/projects/project-id/locations/us-central1/operations/operation-id and get status of done (check here for details)
Conclusion: Cloud Functions are short-lived and must handle only one operation at a time, no waiting. If you want a simple loop for waiting for results, use Cloud Run

waitForSelector suddenly no longer working in puppeteer

I have a working puppeteer script that I'd like to make into an API but I'm having problems with waitForSelector.
Background:
I wrote a puppeteer script that successfully searches for and scrapes the result of a query I specify in the code e.g. let address = xyz;. Now I'd like to make it into an API so that a user can query something. I managed to code everything necessary for the local API (working with express) and everything works as well. By that I mean: I coded all the server side stuff: I can make a request, the scraper function is called, puppeteer starts up, carries out my search (I need to type in an address, choose from a dropdown and press enter).
The status:
The result of my query is a form (basically 3 columns and some rows) in an iFrame and I want to scrape all the rows (I modify them into a specific json later on). The way it works is I use waitForSelector on the form's selector and then I use frame.evaluate.
Problem:
When I run my normal scraper everything works well, but when I run the (slightly modified but essentially same) code within the API framework, waitForSelector suddenly always times out. I have tried all the usual workarounds: waitForNavigation, taking a screenshot and inspecting etc but nothing helped. I've been reading quite a bit and could it be that I'm screwing something up in terms of async/await when I call my scraper from within the context of the API? I'm still quite new to this so please bear with me. This is the code of the working script - I indicated the important part
const puppeteer = require("puppeteer");
const chalk = require("chalk");
const fs = require('fs');
const error = chalk.bold.red;
const success = chalk.keyword("green");
address = 'Gumpendorfer Straße 12, 1060 Wien';
(async () => {
try {
// open the headless browser
var browser = await puppeteer.launch();
// open a new page
var page = await browser.newPage();
// enter url in page
await page.goto(`https://mein.wien.gv.at/Meine-Amtswege/richtwert?subpage=/lagezuschlag/`, {waitUntil: 'networkidle2'});
// continue without newsletter
await page.click('#dss-modal-firstvisit-form > button.btn.btn-block.btn-light');
// let everyhting load
await page.waitFor(1000)
console.log('waiting for iframe with form to be ready.');
//wait until selector is available
await page.waitForSelector('iframe');
console.log('iframe is ready. Loading iframe content');
//choose the relevant iframe
const elementHandle = await page.$(
'iframe[src="/richtwertfrontend/lagezuschlag/"]',
);
//go into frame in order to input info
const frame = await elementHandle.contentFrame();
//enter address
console.log('filling form in iframe');
await frame.type('#input_adresse', address, { delay: 100});
//choose first option from dropdown
console.log('Choosing from dropdown');
await frame.click('#react-autowhatever-1--item-0');
console.log('pressing button');
//press button to search
await frame.click('#next-button');
// scraping data
console.log('scraping')
await frame.waitForSelector('#summary > div > div > br ~ div');//This keeps failing in the API
const res = await frame.evaluate(() => {
const rows = [...document.querySelectorAll('#summary > div > div > br ~ div')];
const cells = rows.map(
row => [...row.querySelectorAll('div')]
.map(cell => cell.innerText)
);
return cells;
});
await browser.close();
console.log(success("Browser Closed"));
const mapFields = (arr1, arr2) => {
const mappedArray = arr2.map((el) => {
const mappedArrayEl = {};
el.forEach((value, i) => {
if (arr1.length < (i+1)) return;
mappedArrayEl[arr1[i]] = value;
});
return mappedArrayEl;
});
return mappedArray;
}
const Arr1 = res[0];
const Arr2 = res.slice(1,3);
let dataObj = {};
dataObj[address] = [];
// dataObj['lagezuschlag'] = mapFields(Arr1, Arr2);
// dataObj['adresse'] = address;
dataObj[address] = mapFields(Arr1, Arr2);
console.log(dataObj);
} catch (err) {
// Catch and display errors
console.log(error(err));
await browser.close();
console.log(error("Browser Closed"));
}
})();
I just can't understand why it would work in the one case and not in the other, even though I barely changed something. For the API I basically changed the name of the async function to const search = async (address) => { such that I can call it with the query in my server side script.
Thanks in advance - I'm not attaching the API code cause I don't want to clutter the question. I can update it if it's necessary
I solved this myself. Turns out the problem wasn't as complicated as I thought and it was annoyingly simple to solve. The problem wasn't with the selector that was timing out but with the previous selectors, specifically the typing and choosing from dropdown selectors. Essentially, things were going too fast. Before the search query was typed in, the dropdown was already pressed and nonsense came out. How I solved it: I included a waitFor(1000) call before the dropdown is selected and everything went perfectly. An interesting realisation was that even though that one selector timed out, it wasn't actually the source of the problem. But like I said, annoyingly simple and I feel dumb for asking this :) but maybe someone will see this and learn from my mistake

Get all messages from AWS SQS in NodeJS

I have the following function that gets a message from aws SQS, the problem is I get one at a time and I wish to get all of them, because I need to check the ID for each message:
function getSQSMessages() {
const params = {
QueueUrl: 'some url',
};
sqs.receiveMessage(params, (err, data) => {
if(err) {
console.log(err, err.stack)
return(err);
}
return data.Messages;
});
};
function sendMessagesBack() {
return new Promise((resolve, reject) => {
if(Array.isArray(getSQSMessages())) {
resolve(getSQSMessages());
} else {
reject(getSQSMessages());
};
});
};
The function sendMessagesBack() is used in another async/await function.
I am not sure how to get all of the messages, as I was looking on how to get them, people mention loops but I could not figure how to implement it in my case.
I assume I have to put sqs.receiveMessage() in a loop, but then I get confused on what do I need to check and when to stop the loop so I can get the ID of each message?
If anyone has any tips, please share.
Thank you.
I suggest you to use the Promise api, and it will give you the possibility to use async/await syntax right away.
const { Messages } = await sqs.receiveMessage(params).promise();
// Messages will contain all your needed info
await sqs.sendMessage(params).promise();
In this way, you will not need to wrap the callback API with Promises.
SQS doesn't return more than 10 messages in the response. To get all the available messages, you need to call the getSQSMessages function recursively.
If you return a promise from getSQSMessages, you can do something like this.
getSQSMessages()
.then(data => {
if(!data.Messages || data.Messages.length === 0){
// no messages are available. return
}
// continue processing for each message or push the messages into array and call
//getSQSMessages function again.
});
You can never be guaranteed to get all the messages in a queue, unless after you get some of them, you delete them from the queue - thus ensuring that the next requests returns a different selection of records.
Each request will return 'upto' 10 messages, if you don't delete them, then there is a good chance that the next request for 'upto' 10 messages will return a mix of messages you have already seen, and some new ones - so you will never really know when you have seen them all.
It maybe that a queue is not the right tool to use in your use case - but since I don't know your use case, its hard to say.
I know this is a bit of a necro but I landed here last night while trying to pull some all messages from a dead letter queue in SQS. While the accepted answer, "you cannot guarantee to get all messages" from the queue is absolutely correct I did want to drop an answer for anyone that may land here as well and needs to get around the 10 message limit per request from AWS.
Dependencies
In my case I have a few dependencies already in my project that I used to make life simpler.
lodash - This is something we use in our code for help making things functional. I don't think I used it below but I'm including it since it's in the file.
cli-progress - This gives you a nice little progress bar on your CLI.
Disclaimer
The below was thrown together during troubleshooting some production errors integrating with another system. Our DLQ messages contain some identifiers that I need in order to formulate cloud watch queries for troubleshooting. Given that these are two different GUIs in AWS switching back and forth is cumbersome given that our AWS session are via a form of federation and the session only lasts for one hour max.
The script
#!/usr/bin/env node
const _ = require('lodash');
const aswSdk = require('aws-sdk');
const cliProgress = require('cli-progress');
const queueUrl = 'https://[put-your-url-here]';
const queueRegion = 'us-west-1';
const getMessages = async (sqs) => {
const resp = await sqs.receiveMessage({
QueueUrl: queueUrl,
MaxNumberOfMessages: 10,
}).promise();
return resp.Messages;
};
const main = async () => {
const sqs = new aswSdk.SQS({ region: queueRegion });
// First thing we need to do is get the current number of messages in the DLQ.
const attributes = await sqs.getQueueAttributes({
QueueUrl: queueUrl,
AttributeNames: ['All'], // Probably could thin this down but its late
}).promise();
const numberOfMessage = Number(attributes.Attributes.ApproximateNumberOfMessages);
// Next we create a in-memory cache for the messages
const allMessages = {};
let running = true;
// Honesty here: The examples we have in existing code use the multi-bar. It was about 10PM and I had 28 DLQ messages I was looking into. I didn't feel it was worth converting the multi-bar to a single-bar. Look into the docs on the github page if this is really a sticking point for you.
const progress = new cliProgress.MultiBar({
format: ' {bar} | {name} | {value}/{total}',
hideCursor: true,
clearOnComplete: true,
stopOnComplete: true
}, cliProgress.Presets.shades_grey);
const progressBar = progress.create(numberOfMessage, 0, { name: 'Messages' });
// TODO: put in a time limit to avoid an infinite loop.
// NOTE: For 28 messages I managed to get them all with this approach in about 15 seconds. When/if I cleanup this script I plan to add the time based short-circuit at that point.
while (running) {
// Fetch all the messages we can from the queue. The number of messages is not guaranteed per the AWS documentation.
let messages = await getMessages(sqs);
for (let i = 0; i < messages.length; i++) {
// Loop though the existing messages and only copy messages we have not already cached.
let message = messages[i];
let data = allMessages[message.MessageId];
if (data === undefined) {
allMessages[message.MessageId] = message;
}
}
// Update our progress bar with the current progress
const discoveredMessageCount = Object.keys(allMessages).length;
progressBar.update(discoveredMessageCount);
// Give a quick pause just to make sure we don't get rate limited or something
await new Promise((resolve) => setTimeout(resolve, 1000));
running = discoveredMessageCount !== numberOfMessage;
}
// Now that we have all the messages I printed them to console so I could copy/paste the output into LibreCalc (excel-like tool). I split on the semicolon for rows out of habit since sometimes similar scripts deal with data that has commas in it.
const keys = Object.keys(allMessages);
console.log('Message ID;ID');
for (let i = 0; i < keys.length; i++) {
const message = allMessages[keys[i]];
const decodedBody = JSON.parse(message.Body);
console.log(`${message.MessageId};${decodedBody.id}`);
}
};
main();

Nodejs/Puppeteer - Navigation timeout

I need help to undestand how timeout works, especially with node/puppeteer
I read all stack questions and github issues about this, but i can figure it out what is wrong
Probably my code...
When i run this file, i receive the error from image. You can see the ways i tryied to fix it, nothing works
Can someone explain why this happens and the best approach to avoid this? Is there a better way to get these Projects?
//vou até os seeds em x tempo
var https = require('https');
var Q = require('q');
var fs = require('fs');
var puppeteer = require('puppeteer');
var Projeto = require('./Projeto.js');
const url = 'https://www.99freelas.com.br/projects?categoria=web-e-desenvolvimento'
/*const idToScrape;
deverá receber qual a url e os parametros específicos de cada seed */
async function genScraper() {
const browser = await puppeteer.launch();
const page = await browser.newPage();
//page.setDefaultNavigationTimeout(60000);
page.waitForNavigation( { timeout: 60000, waitUntil: 'domcontentloaded' });
await page.goto(url);
var projetos = await page.evaluate(() => {
let qtProjs = document.querySelectorAll('.result-list li').length;
let listaDeProjs = Array.from(document.querySelectorAll('.result-list li'));
let tempProjetos = [];
for( var i=0; i<=listaDeProjs.length; i++ ) {
let titulo = listaDeProjs[i].children[1].children[0].textContent;
let descricao = listaDeProjs[i].children[2].textContent;
let habilidades = listaDeProjs[i].children[3].textContent;
let publicado = listaDeProjs[i].children[1].children[1].children[0].textContent;
let tempoRestante = listaDeProjs[i].children[1].children[1].children[1].textContent;
//let infoCliente;
proj = new Projeto(titulo, descricao, habilidades, publicado, tempoRestante);
tempProjetos.push(proj);
}
return tempProjetos;
});
console.log(projetos);
browser.close();
}
genScraper();
I recommend you to avoid using the method waitForNavigation before the goTo call.
Basically, It would be better to use the method gotTo with the default value, that is 30000. In my opinion, if the website takes more than 30 seconds to work or respond, there should be something wrong.
Instead, I would do something like this:
await page.goto(url, {
waitUntil: 'networkidle0'
});
Depending on the version of puppeteer that you're using, you will have different behaviours. I am using version 1.4.0 and it is working good so far.
Inside the documentation states the following:
The page.goto will throw an error if:
there's an SSL error (e.g. in case of self-signed certificates).
target URL is invalid.
the timeout is exceeded during navigation.
the main resource failed to load.
So, check that none of the previous scenarios is happening.
Also, you can curl the URL from your terminal to see if the URL respond to outside calls, cross origin problems are common too.
Sincerely, there is no way to say what can be triggering your timeout, but that checklist should help. I had a problem with timeout recently and the problem was my server configuration, so I suggest you to see also if the machine in which you are running this code, has the necessary memory to execute.
In your for loop,
for( var i=0; i<=listaDeProjs; i++ ) {
...
}
listaDeProjs should be listaDeProjs.length
Your evaluation script will fail in several places, if anywhere along this path is undefined: (E.g., if children[1] is undefined or children[0] is undefined.)
listaDeProjs[i].children[1].children[0].textContent;
You can do the following with lodash:
_.get(listaDeProjs[i],"children[1].children[0].textContent","")
That will default to "" if there is no such value.
Additionally, the following works perfectly fine with your code in 1.7 via https://try-puppeteer.appspot.com/
await page.goto(url, {
waitUntil: 'networkidle2',
timeout: '5000'
});

Resources