I am working on an aws lambda that requires a puppeteer browser to be launched for each new s3 object in a bucket. The browser launch code was taking a very long time on the initial invocation, so I thought I would put the launch code outside the handler and use Provisioned Concurrency to have the browser ready to go when a new file in inserted into the bucket.
It does seem to call the promise because before any actual invocations are made, I'm getting logs saying "Getting executable path from the provisioned concurrency instances. However, it never outputs the message "Launching browser" until an actual invocation of the lambda is made. Why would the promise chromium.executablePath not complete until an invocation is made if it is outside the handler?
let startTime = Date.now();
const chromium = require("chrome-aws-lambda");
const AWS = require("aws-sdk");
const s3 = new AWS.S3();
const { createSSRApp } = require("vue");
const { renderToString } = require("vue/server-renderer");
const path = require("path");
const fs = require("fs");
const manifest = require("../../compiled/ssr-manifest.json");
console.log("Load packages: " + (Date.now() - startTime));
const browserPromise = new Promise((res) => {
const browserStartTime = Date.now();
console.log("Getting executable path");
chromium.executablePath.then((executablePath) => {
console.log("Launching browser");
chromium.puppeteer
.launch({
args: chromium.args,
defaultViewport: chromium.defaultViewport,
executablePath: executablePath,
headless: true,
})
.then((browser) => {
res(browser);
console.log("Start headless browser: " + (Date.now() - browserStartTime));
});
});
});
browserPromise.then(() => console.log("Started Headless Browser"));
/**
* A Lambda function that logs the payload received from S3.
*/
exports.handler = async (event, context) => {
const bucketName = event.Records[0].s3.bucket.name;
const objectKey = event.Records[0].s3.object.key;
const browser = await browserPromise;
... //use browser code
}
If I require this file locally in another node file it runs the promise fine without calling the handler function, so it must be some lambda environment specific thing I'm not understanding. Does anyone have any insight into this? Thanks in advance.
The issue you describe results from poor control of the async code.
Delay instantiation of browser promise
When you instantiate a Promise using the new keyword, execution of the function you provide starts immediately.
const x = new Promise(res => console.log('test'))
This will print 'test' immediately without needing a .then or await. This is why your code prints 'Getting executable path' right away, vs. waiting for a request event from lambda.
To solve this, don't instantiate this promise until a request actually happens. Move your promise construction to a function that you can call from the handler when a request occurs.
async function startBrowser () {
// code to start browser
return browser
}
exports.handler = async (event, context) => {
const bucketName = event.Records[0].s3.bucket.name;
const objectKey = event.Records[0].s3.object.key;
const browser = await startBrowser();
// use browser
}
Fixing async return flow
Secondly, you need to make your startBrowser function actually return a browser. Because you haven't awaited any of the promises created inside your browserPromise, it will trigger the code to start chromium but resolve immediately. It will take some time for the browser to start, which is why you don't see 'Launching browser' until much later.
To fix this, make sure your browser promise doesn't resolve until the browser is ready, and then return the browser object so it can be used.
function startBrowser () {
const browserStartTime = Date.now();
console.log("Getting executable path");
// await this promise
const browser = await chromium.executablePath.then((executablePath) => {
console.log("Launching browser");
// return browser promise
return chromium.puppeteer
.launch({
args: chromium.args,
defaultViewport: chromium.defaultViewport,
executablePath: executablePath,
headless: true,
})
console.log("Start headless browser: " + (Date.now() - browserStartTime));
return browser
}
exports.handler = async (event, context) => {
const bucketName = event.Records[0].s3.bucket.name;
const objectKey = event.Records[0].s3.object.key;
const browser = await startBrowser();
// use browser
}
Improving performance by sharing browser across requests
You can make further performance improvements by saving the browser in a singleton so that a new one doesn't need to be instantiated every request cycle.
let browser; // singleton/global browser object
// start a browser to be used for all requests
// this will take a little time, so hold a reference to the promise so we
// can know when it is ready to use
const browserPromise = startBrowser.then(newBrowser => { browser = newBrowser }
exports.handler = async (event, context) => {
const bucketName = event.Records[0].s3.bucket.name;
const objectKey = event.Records[0].s3.object.key;
// if the first request comes before the browser is ready, we should
// wait for the promise to resolve
if (!browser) await browserPromise
// use browser
}
Notes about memory usage
Browsers can use a lot of memory and can also leak memory. If fixing your async code still does not make the lambda work, consider that loading a webpage can consume gigabytes of memory especially for large and complex pages (Google Maps for example). I'm not experienced with lambda but you may find yourself running into memory limits.
Related
I need to upload a v8 heap dump into an AWS S3 bucket after it's generated however the file that is uploaded is either 0KB or 256KB. The file on the server is over 70MB in size so it appears that the request isn't waiting until the heap dump isn't completely flushed to disk. I'm guessing the readable stream that is getting piped into fs.createWriteStream is happening in an async manner and the await with the call to the function isn't actually waiting. I'm using the v3 version of the AWS NodeJS SDK. What am I doing incorrectly?
Code
async function createHeapSnapshot (fileName) {
const snapshotStream = v8.getHeapSnapshot();
// It's important that the filename end with `.heapsnapshot`,
// otherwise Chrome DevTools won't open it.
const fileStream = fs.createWriteStream(fileName);
snapshotStream.pipe(fileStream);
}
async function pushHeapSnapshotToS3(fileName)
{
const heapDump = fs.createReadStream(fileName);
const s3Client = new S3Client();
const putCommand = new PutObjectCommand(
{
Bucket: "my-bucket",
Key: `heapdumps/${fileName}`,
Body: heapDump
}
)
return s3Client.send(putCommand);
}
app.get('/heapdump', asyncMiddleware(async (req, res) => {
const currentDateTime = Date.now();
const fileName = `${currentDateTime}.heapsnapshot`;
await createHeapSnapshot(fileName);
await pushHeapSnapshotToS3(fileName);
res.send({
heapdumpFileName: `${currentDateTime}.heapsnapshot`
});
}));
Your guess is correct. The createHeapSnapshot() returns a promise, but that promise has NO connection at all to when the stream is done. Therefore, when the caller uses await on that promise, the promise is resolved long before the stream is actually done. async functions have no magic in them to somehow know when a non-promisified asynchronous operation like .pipe() is done. So, your async function returns a promise that has no connection at all to the stream functions.
Since streams don't have very much native support for promises, you can manually promisify the completion and errors of the streams:
function createHeapSnapshot (fileName) {
return new Promise((resolve, reject) => {
const snapshotStream = v8.getHeapSnapshot();
// It's important that the filename end with `.heapsnapshot`,
// otherwise Chrome DevTools won't open it.
const fileStream = fs.createWriteStream(fileName);
fileStream.on('error', reject).on('finish', resolve);
snapshotStream.on('error', reject);
snapshotStream.pipe(fileStream);
});
}
Alternatively, you could use the newer pipeline() function which does support promises (built-in promise support added in nodejs v15) and replaces .pipe() and has built-in error monitoring to reject the promise:
const { pipeline } = require('stream/promises');
function createHeapSnapshot (fileName) {
const snapshotStream = v8.getHeapSnapshot();
// It's important that the filename end with `.heapsnapshot`,
// otherwise Chrome DevTools won't open it.
return pipeline(snapshotStream, fs.createWriteStream(fileName))
}
I am using Puppeteer to do some webscraping which is executed on a scheduled pubsub Cloud Function. The issue that I have is that the page.goto() and page.waitForSelector() never ever completes when I deploy my function onto Firebase Cloud Function. The script works fine locally on my machine.
Here is my implementation so far:
//Scheduled pubsub function at ./functions/index.js
exports.scraper = functions.pubsub
.onRun((context) => {
var scraper = new ScraperManager();
return scraper.start();
})
//Entry function
ScraperManager.prototype.start = async function() {
var webpagePromises = []
for (const agency of agencies) {
for (page_num = 0; page_num < num_of_pages; page_num++) {
const url = setupUrl(agency, page_num); //Returns a url
const webpagePromise = getWebpage(agency, page_num, url)
webpagePromises.push(webpagePromise)
}
}
return Promise.all(webpagePromises)
}
async function getWebpage(agency, page_num, url) {
var data = {}
const browser = await puppeteer.launch(constants.puppeteerOptions);
try {
const page = await browser.newPage();
await page.setUserAgent(constants.userAgent);
await page.goto(url, {timeout: 0});
console.log("goto completes")
await page.waitForSelector('div.main_container', {timeout: 0});
console.log("waitFor completes")
const body = await page.$eval('body', el => el.innerHTML);
data['html'] = body;
return data;
} catch (err) {
console.log("Puppeteer error", err);
return;
} finally {
await browser.close();
}
}
//PuppeteerOptions in constants file
puppeteerOptions: {
headerless: true,
args: [
'--disable-gpu',
'--no-sandbox',
]
}
Note that the {timeout: 0} is necessary as the page.goto and page.waitForSelector() takes more time than the default timeout value of 30000ms. Even with the timeout disabled, both goto and waitForSelector() never completes, ie both the console.log() statements do not get logged. The above script works fine when running the script locally, ie console.log() does print out correctly, but never works when deployed on Cloud Functions. The entire cloud function always get timedout (presently set at 300s) without any logs printed.
Would anybody be able to advice?
I had the same kind of issue: a stupid page.$eval() to get a simple node would never return (actually it was taking more than 3 minutes ...) or crash the page after more than 5 minutes on a virtual private server, while the same script was doing fine on my local computer.
Looking at the virtual server's RAM usage (around 99%), I've come to the conclusion that this kind of script cannot run on a server with only 2 GB or RAM.
:-(
I am trying to call an async function in app.js to initialize parameters (fetch configuration from AWS, connect to DB, etc.)
This is the init.js that does the initialization:
const fs = require('fs');
var params_path = process.argv.slice(2);
const paramdata = JSON.parse(fs.readFileSync(params_path[0]));
module.exports.initApp = async function()
{
// Retrieve the configuration parameters from AWS
const aws = await require('./util/configparams').AWS(paramdata);
}
This is configparams.js:
async function AWS(params)
{
awscredentials = new AWSCredentials(params);
await awscredentials.initWalletParams();
await awscredentials.initDbParams();
return awscredentials;
}
initWalletParams and initDbParams are async functions that have await statements.
Then this is my app.js where I call the init:
(async() => {
init = require('./init');
await init.initApp();
})();
The execution reaches initApp() and goes to await awscredentials.initWalletParams(); function but it does not wait there.
When I run init.js by itself the execution happens sequentially as expected but calling in app.js await is not waiting as expected.
Could someone please help figure this out?
Thank you.
I am trying to scrape https://www.premierleague.com/clubs/38/Wolverhampton-Wanderers/stats?se=274
The results being returned are for the page minus the ?se=274
This is applied by using the filter dropdown on the page and selecting 2019/20 season. I can navigate directly to the page and it works fine, but through code it does not work.
I have tried in cheerio and puppeteer. I was going to try nightmare too but this seems overkill I think. I am clearly not an expert! ;)
function getStats(callback){
var url = "https://www.premierleague.com/clubs/38/Wolverhampton-Wanderers/stats?se=274";
request(url, function (error, response, html) {
//console.log(html);
var $ = cheerio.load(html);
if(!error){
$('.allStatContainer.statontarget_scoring_att').filter(function(){
var data = $(this);
var vSOT = data.text();
//console.log(data);
console.log(vSOT);
});
}
});
callback;
}
This will return 564 instead of 2
It seems like you're calling callback before request returns. Move the callback call into the internal block, where the task you need is completed (in your case, it looks like the filter block).
It also looks like you're missing the () on the callback call.
Also, a recommendation: return the value you need through the callback.
So this code works....$10 from a rent-a-coder did the trick. Easy when you know how!
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch()
const page = await browser.newPage()
await page.goto('https://www.premierleague.com/clubs/4/Chelsea/stats?se=274')
const sleep = ms => new Promise(resolve => setTimeout(resolve, ms))
await sleep(4000)
const element = await page.$(".allStatContainer.statontarget_scoring_att");
const text = await page.evaluate(element => element.textContent, element);
console.log("Shots on Target:"+text)
browser.close()
})()
I have a very simple Puppeteer script that uses exposeFunction() to run something inside headless Chrome.
(async function(){
var log = console.log.bind(console),
puppeteer = require('puppeteer');
const browser = await puppeteer.launch();
const page = await browser.newPage();
var functionToInject = function(){
return window.navigator.appName;
}
await page.exposeFunction('functionToInject', functionToInject);
var data = await page.evaluate(async function(){
console.log('woo I run inside a browser')
return await functionToInject();
});
console.log(data);
await browser.close();
})()
This fails with:
ReferenceError: window is not defined
Which refers to the injected function. How can I access window inside the headless Chrome?
I know I can do evaluate() instead, but this doesn't work with a function I pass dynamically:
(async function(){
var log = console.log.bind(console),
puppeteer = require('puppeteer');
const browser = await puppeteer.launch();
const page = await browser.newPage();
var data = await page.evaluate(async function(){
console.log('woo I run inside a browser')
return window.navigator.appName;
});
console.log(data);
await browser.close();
})()
evaluate the function
You can pass the dynamic script using evaluate.
(async function(){
var puppeteer = require('puppeteer');
const browser = await puppeteer.launch();
const page = await browser.newPage();
var functionToInject = function(){
return window.navigator.appName;
}
var data = await page.evaluate(functionToInject); // <-- Just pass the function
console.log(data); // outputs: Netscape
await browser.close();
})()
addScriptTag and readFileSync
You can save the function to a seperate file and use the function using addScriptTag.
await page.addScriptTag({path: 'my-script.js'});
or evaluate with readFileSync.
await page.evaluate(fs.readFileSync(filePath, 'utf8'));
or, pass a parameterized funciton as a string to page.evaluate.
await page.evaluate(new Function('foo', 'console.log(foo);'), {foo: 'bar'});
Make a new function dynamically
How about making it into a runnable function :D ?
function runnable(fn) {
return new Function("arguments", `return ${fn.toString()}(arguments)`);
}
The above will create a new function with provided arguments. We can pass any function we want.
Such as the following function with window, along with arguments,
function functionToInject() {
return window.location.href;
};
works flawlessly with promises too,
function functionToInject() {
return new Promise((resolve, reject) => {
setTimeout(() => {
resolve(window.location.href);
}, 5000);
});
}
and with arguments,
async function functionToInject(someargs) {
return someargs; // {bar: 'foo'}
};
Call the desired function with evaluate,
var data = await page.evaluate(runnable(functionToInject), {bar: "foo"});
console.log(data); // shows the location
exposeFunction() isn't the right tool for this job.
From the Puppeteer docs
page.exposeFunction(name, puppeteerFunction)
puppeteerFunction Callback function which will be called in Puppeteer's context.
'In puppeteer's context' is a little vague, but check out the docs for evaluate():
page.evaluateHandle(pageFunction, ...args)
pageFunction Function to be evaluated in the page context
exposeFunction() doesn't expose a function to run inside the page, but exposes a function to be be run in node to be called from the page.
I have to use evaluate():
You problem could be related to the fact that page.exposeFunction() will make your function return a Promise (requiring the use of async and await). This happens because your function will not be running inside your browser, but inside your nodejs application and its results are being send back and forth into/to the browser code. This is why you function passed to page.exposeFunction() is now returning a promise instead of the actual result. And it explains why the window function is not defined, because your function is running inside nodejs (not your browser) and inside nodejs there is no window definition available.
Related questions:
exposeFunction() does not work after goto()
exposed function queryseldtcor not working in puppeteer
How to use evaluateOnNewDocument and exposeFunction?
exposeFunction remains in memory?
Puppeteer: pass variable in .evaluate()
Puppeteer evaluate function
allow to pass a parameterized funciton as a string to page.evaluate
Functions bound with page.exposeFunction() produce unhandled promise rejections
How to pass a function in Puppeteers .evaluate() method?
How can I dynamically inject functions to evaluate using Puppeteer?