How to reduce Puppeteer size - node.js

I'm using Puppeteer for webscraping, with a small NodeJs webapp that I made. This webapp is hosted on Heroku and use jontewks/puppeteer-heroku-buildpack to works.
The problem I'm facing is that my app do not build anymore because of the Heroku size limit:
Compiled slug size: 537.4M is too large (max is 500M).
I've tried severals things:
Using Firefox instead of Chromium
It's a "no go" for me because of a current issue with puppeteer/firefox:
Reducing the size of Chromium by removing the file interactive_ui_tests.exe
I can't do this because Heroku use Linux instead of Windows, and this file does not exist in the Linux Chromium distribution
Using headless_shell instead of Chromium
I'm stuck with this (like here) as I do not understand how to make it works. I found the file to use here, but I'm facing the same issue as the comment from the 07/09/2018
Using Playwright instead of Puppeteer
It might be a solution, but I'm using stuffs like puppeteer-extra and puppeteer-extra-plugin-stealth, so it bother me to change
Reducing the size of Chromium by removing the folder locales
It helps a bit, but not much
Using an older version of Puppeteer (2.1.1), which is using an older version Chromium who was slighlty lighter
At the moment, it's the only working solution that I have
Use the command heroku repo:gc -a myapp and heroku builds:cache:purge -a myapp
My last three points reduced the size of my slug to 490M. So my app is working, but it's not great for the (close) future, like having an up to date Puppeteer version.
So here I am, asking for help, as I do not have any more ideas at the moment.
Thank you very much for your help 🙏

Finally, I end up using Playwright.
With this Buildpack, the build of my app is only 250Mb!
Here's a few steps I've followed:
Install with NPM playwright-chromium to only download Chromium.
Set PLAYWRIGHT_BUILDPACK_BROWSERS env variable to chromium in Heroku to only install Chromium dependencies.
Put this buildpack before Node.js buildpack in Heroku.
With this trick you can use most of the of stuff from puppeteer-stealth.
If you want, you can block resources like in Puppeteer:
await page.route('**/*', route => ([
'stylesheet',
'image',
'media',
'font',
// 'script',
'texttrack',
'xhr',
'fetch',
'eventsource',
'websocket',
'manifest',
'other',
].includes(route.request().resourceType()) ? route.abort() : route.continue()))

Related

Cloud Functions Puppeteer cannot open browser

My setup in GCF:
install npm install --save puppeteer from project cloud shell
edit package.json like so:
{ "dependencies": { "puppeteer": "^19.2.2" } }
paste code from medium.com into index.js:
https://gist.githubusercontent.com/Alezco/b9b7ce4ec7ee7f208818e395225fcbbe/raw/8554acc8b311a10e272f5d1b98dce3400945bb00/index.js
deploy with 2 GB RAM, 0-3 instances, max 500s timeout
I get these errors after building or opening the URL:
Internal Server Error
Could not find Chromium (rev. 1056772). This can occur if either 1. you did not perform an installation before running the script (e.g. npm install) or 2. your cache path is incorrectly configured (which is: /workspace/.cache/puppeteer). For (2), check out our guide on configuring puppeteer at https://pptr.dev/guides/configuration.
When I run npm list both webdriver and puppeteer are installed. I suspect there is an issue this Path but I cannot figure out where it should lead.
I could then provide puppeteer.launch() with argument executablePath which might solve the problem.
I tried reinstalling puppeteer and changing configuration. No luck.
In addition to adding a .puppeteerrc.cjs per Kristofer's answer, I added a postinstall script in my package.json:
"scripts": {
...
"postinstall": "node node_modules/puppeteer/install.js"
},
This fixed the problem and I was able to deploy my Google Cloud Function. This is a temporary fix until issue #9128 is fixed.
I had the exact same issue and it seems to be related to this https://github.com/puppeteer/puppeteer/issues/9128
I'm using Firebase and don't have complete control over the build process when I deploy my functions but I still have access to the build logs. From the issue above, I realized I needed to handle the cache directory and the NPM version for this to work.
As far as I can tell the problem is that the build step installs the Chrome browser needed for Puppeteer in a cache directory outside of the final image that is used for the actuall function. In that context the error message makes more sence, it can't find the browser therefor it doesn't work.
I was using Node 14 in my cloud functions which used NPM 6.14.17 in the build steps. According to the issue you need to use NPM > 7 so I upgraded my function to use Node 16.
Then I added the .puppeteerrc.cjs from https://pptr.dev/guides/configuration/#examples when testing that locally it will add a .cache directory where the Chrome installation is. This has to be ignored when deploying the cloud function or the deplot will fail due to size.
In the firebase.json add:
"functions": {
"ignore": [
".cache"
]
},
The last step is pretty specific for Firebase and I'm not sure how this applies to your build steps etc. But this solved my issue that had the exact same error messages as you had. So double check the following:
NPM version in the build step, needs to beat least v.7 - Node 16 should do this.
Cache directory specified in the .puppeteerrc.cjs
It also looks like your using an old Puppeteer version, I used 19.3
I got this same (very missleading) error in my Puppeteer project which I run in Google Cloud Function. The issue was that Function was finishing (exiting) before the async Puppeteer script was finished.
Resolved this issue by changing the "await browser.close();" to a Promise and creating the response message in promise.then().
... only to hit next problem. My script is not downloading the csv file as expected. Works locally though...

ChromeNotInstalledError while using chrome-launcher npm package

I am using chrome-launcher for running lighthouse programmatically. It works fine locally but when I run it on azure I am getting an error.
On this statement const chrome = await chromeLauncher.launch({chromeFlags: ['--headless']}); I am getting the following error:
ChromeNotInstalledErrorat new LauncherError (C:\home\site\wwwroot\node_modules\chrome-launcher\dist\utils.js:37:22)at new ChromeNotInstalledError (C:\home\site\wwwroot\node_modules\chrome-launcher\dist\utils.js:68:9){message: 'No Chrome installations found.',code: 'ERR_LAUNCHER_NOT_INSTALLED'}
How can I solve this?
You need to install Chrome on the Azure Function app somehow.
One way to do this is by using an npm dependency that installs Chrome as part of its install process. Examples of this are puppeteer and playwright. Although then you end up with some unnecessary dependencies.
You could also have a startup script or something that installs Chrome before running chrome-launcher/lighthouse. You'll need to tell crome-launcher where Chrome is installed if it's not a standard place using chromePath options or CHROME_PATH environment variable link.
You also have to ensure you do a Remote Build for the Function app.
You will also run into this error, which has a possible workaround: https://github.com/GoogleChrome/chrome-launcher/issues/188
Overall it's not easy. I actually ended up moving my workflow to GitHub Actions instead as Chrome is already installed on their runner images.
See:
https://anthonychu.ca/post/azure-functions-headless-chromium-puppeteer-playwright/

Failed to launch the browser process on Heroku

I built an app using as core node, express and sulla (import puppeteer).
Basically I scrape some data and use sulla to send them via whatsapp.
It works fine on local but when I deploy it on heroku I'm faced with this issue :
Failed to launch the browser
process!\n[0601/222716.792459:FATAL:zygote_host_impl_linux.cc(116)] No
usable sandbox! Update your kernel or see
https://chromium.googlesource.com/chromium/src/+/master/docs/linux_suid_sandbox_development.md
for more information on developing with the SUID sandbox. If you want
to live dangerously and need an immediate workaround, you can try
using --no-sandbox ...... Core file will not be generated.
TROUBLESHOOTING:
https://github.com/puppeteer/puppeteer/blob/master/docs/troubleshooting.md
I've already added the following buildpacks to my heroku app :
https://github.com/jontewks/puppeteer-heroku-buildpack.git
heroku/nodejs
https://github.com/heroku/heroku-buildpack-chromedriver
I've seen solutions like https://stackoverflow.com/a/52228855, but I can't apply it since I'm not directly using puppeteer. Or clear heroku caches without success.
Until you are using the current npm package of sulla unfortunately it won't work for you on Heroku. As the linked question says, you need to launch puppeteer with --no-sandbox (the --disable-setuid-sandbox arg is not mandatory for Heroku):
await puppeteer.launch({ args: ['--no-sandbox'] })
Sulla lacks this arg in the npm package (launch) (current launch config) (not used launch config that would work with Heroku).
It is very good that you've already added the buildpacks, those are needed if puppeteer is running as a dependency in the background.
I.) You could try a fork of sulla, called sulla-hotfix: https://www.npmjs.com/package/#jprustv/sulla-hotfix if it suits your needs. This one still uses the previous sulla puppeteer config, which apparently contains --no-sandbox launch arg.
It is true for the original project sulla was forked from: #open-wa/wa-automate. It may works on Heroku with the buildpacks.
II.) Or you could publish a modified version of sulla under MIT license, containing the right launch parameter.

Problems with webpack build, using angular-cli

I am using angular-cli with built-in webpack and encounter the following problem: if I use ng serve --host 0.0.0.0 --port 3000 everything works fine, but when I try to build the app (no matter developer mode or production) with ng build and then put it into my nginx, not a single route works and every attempt of browser to download image, which is used on start page, ends with 404 error. What am I doing wrong? Googled a ton of stuff and nothing seems to be a solution.
Some additional info:
angular-cli: 1.0.0-beta.21
node: 6.9.4
os: win32 x64
So for everyone who encounters this problem, after some research I found this:
1) To solve problem with styles not loading properly, add encapsulation: ViewEncapsulation.None to each of your components #Component() decorator, this way it's going to bundle styles properly. More info here: https://angular.io/docs/ts/latest/guide/component-styles.html#!#loading-styles
2) Make sure all your images are stored inside assets folder. If you store them in some other folder like images, just put that images folder inside assets
3) And for information about paths not working on page reload, have a look at this question and read some information about strategies: Angular 2 : 404 error occur when i refresh through Browser
Also this link has config examples https://github.com/angular-ui/ui-router/wiki/Frequently-Asked-Questions#how-to-configure-your-server-to-work-with-html5mode

Azure and node js __dirname

Probably it is not specifically related to webpack/memory-fs, but I am getting the RangeError: Maximum call stack size exceeded error (see below for a call stack).
I have found out, that __dirname on Azure (webapp) returns \\100.78.172.13\volume-7-default\8f5ecde749dace2bb57a\4e07195f015b45ce8e9ba255dc901988\site\repository\Source\Website\Content\app\node_modules\webpack\node_modules\memory-fs\lib\normalize.js in my situation, while process.cwd() returns D:\home\site\repository\Source\Website\Content\app.
Is anything can be done from my side to configure node js to return D:\... instead of \\.. ?
Gist
How to reproduce:
Clone the https://github.com/intellismiths/webapp1 repository.
Create new Azure Web App (default settings).
Configure deployment source to use GitHub.
Click Sync. It will take 10+ minutes to complete and it will show that the deployment was successful.
Go to Application settings in Azure and change WEBSITE_NODE_DEFAULT_VERSION to 6.2.2
Go to kudu page and open powershell console.
Execute npm cache clean
Check node version by executing node -v. It should be v6.2.2
On Azure, navigate to D:\home\site\respository\src\WebApp1
Execute npm run build
In console, you should see a lot of errors which indicates that modules can not be resolved.
OPTIONAL. Test npm run build on your local machine - it should produce wwwroot/app.js without errors.
Update webpack.config.js to include context: __dirname to fix previous errors.
Execute npm run build
In console, you should see the "RangeError: Maximum call stack size exceeded" error.
Update 1
I only tried to set 6.2.2 runtime after adding the second package.json, so the project structure is not the simplest possible. Maybe just setting node to 6.2.2 breaks the build.
I could reproduce your issue following your steps. I found the key point was setting the WEBSITE_NODE_DEFAULT_VERSION to 6.2.2. And I found the webpack task worked fine if the WEBSITE_NODE_DEFAULT_VERSION was under 6.
Please downgrade the setting WEBSITE_NODE_DEFAULT_VERSION to the version under 6 e.g. 5.9.0 if your node.js modules do not need such high version.
And according the package.json of angular2 athttps://github.com/angular/angular/blob/master/package.json, it seems that the angular2 repository requires the node.js version between 5.4 and 6.
Additionally, the web application's root directory on Azure Web Apps is D:\home\site\wwwroot. So if you want to build your frontend project on Azure Web Apps, you need to locate to D:\home\site\wwwroot\wwwroot\mobile-web-app then run npm run build.
It's been fixed in master and it's proposed to be included in v6.4.0.
See: https://github.com/nodejs/node/issues/7175#issuecomment-239824532 and https://github.com/nodejs/node/pull/8070
After a long day of research, trial-and-error and various experimentation, I've found an acceptable workaround if you're not willing to downgrade to Node 5.*:
Downgrade to Node 6.1.0
Make sure to install webpack globally (with npm install -g webpack).
Just using 6.1.0 gets around the "maximum call stack size exceeded" error, but instead gave me a lot of resolve failures when running webpack from node_modules (using ./node_modules/.bin/webpack). Installing webpack globally finally got me past that.
If I understand it correctly, this whole issue with __dirname in Node >= 6.2 resolving to the UNC folder path instead of the mounted path is going to be fixed, there's an active discussion here.
I had the same issue.
Fixed it with UPGRADING npm not DOWNGRADING.
Bug is fixed in the npm versions newer than 6.5.
https://github.com/aumanjoa/chronas-community/blob/master/package.json#L48
I believe that your __dirname shows your persistant drive where the data is stored, while .cwd gives current directory from where node ran. This is because Azure runs from the Drive but files are stored at the persistent drive.
In your Gruntfile.js add
module.exports = function (grunt) {
grunt.file.setBase(__dirname);
// Code omitted
}
Refer: link

Resources