How to kill old Puppeteer browser if still running? - node.js

I am trying to scrape data from different websites using only one Puppeteer instance. I don't want to launch a new browser for each website. So I need to check if any existing browser has already launched then just open a new tab. I did something like the below, Some conditions I always check before launching any browser
const browser = await puppeteer.launch();
browser?.isConnected()
browser.process() // null if browser is still running
but still, I found sometimes my script re-launch the browser if any old browser has already been launched. So I am thinking to kill if any old browser has been launched or what would be the best check? Any other good suggestion will be highly appreciated.

I'm not sure if that specific command (Close existing browsers) can be done inside puppeteer's APIs, but what I could recommend is how would people usually handle this situation which is to make sure that the browser instance is closed if any issue was encountered:
let browser = null;
try {
browser = await puppeteer.launch({
headless: true,
args: ['--no-sandbox'],
});
const page = await browser.newPage();
url = req.query.url;
await page.goto(url);
const bodyHTML = await page.evaluate(() => document.body.innerHTML);
res.send(bodyHTML);
} catch (e) {
console.log(e);
} finally {
if (browser)
await browser.close();
}
Otherwise, you can use shell based commands like kill or pkill if you have access to the process ID of the previous browser.

The most reliable means of closing a puppeteer instance that I've found is to close all of the pages within a BrowserContext, which automatically closes the Browser. I've seen instances of chromium linger in Task Manager after calling just await browser.close().
Here is how I do this:
const openAndCloseBrowser = async () => {
const browser = await puppeteer.launch();
try {
// your logic
catch(ERROR) {
// error handling
} finally {
const pages = await browser.pages();
for(const page of pages) await page.close();
}
}
If you try running await browser.close() after running the loop and closing each page individually, you should see an error stating that the browser was already closed and your Task Manager should not have lingering chromium instances.

Related

How to get lastModified property of another website

When I use the inspect/developer tool in chrome I can find the last modified date from browser but I want to see the same date in my nodeJS application.
I have already tried
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://www.tbsnews.net/economy/bsec-chairman-stresses-restoring-investor-confidence-mutual-funds-500126');
const names = page.evaluate( ()=> {
console.log(document.lastModified);
})
Unfortunately this code shows the current time of new DOM creation as we are using newPage(). Can anyone help me ?
I have also tired JSDOM as well.
Thanks in advance.

Webscraping TimeoutError: Navigation timeout of 30000 ms exceeded

I'm trying to extract some table from a company website using puppeteer.
But I don't understand why the browser open Chromium instead my default Chrome, which then lead to "TimeoutError: Navigation timeout of 30000 ms exceeded", not let me enough time to use CSS Selector. I don't see any document about this.
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({headless: false});
const page = await browser.newPage()
await page.goto('https://www....com');
//search tearm
await page.type("#search_term","Brazil");
//await page.screenshot({path: 'sc2.png'});
//await browser.close();
})();
Puppeteer, is Chromium based by default.
If you wish to use Chrome instead you have to specify the executable path through the executablePath launch parameter. But to be honest, most of the time, there is no point to do so.
let browser = await puppeteer.launch({
executablePath: `/path/to/Chrome`,
//...
});
There is no correlation between TimeoutError: Navigation timeout of 30000 ms exceeded and the use chromium rather it is more likely that your target url isn't (yet) available.
page.goto will throw an error if:
there's an SSL error (e.g. in case of self-signed certificates).
target URL is invalid.
the timeout is exceeded during navigation.
the remote server does not respond or is unreachable.
the main resource failed to load.
By default, the maximum navigation timeout is 30 seconds. If for some reason, your target url requires more time to load (which seems unlikely), you can specify a timeout: 0 option.
await page.goto(`https://github.com/`, {timeout: 0});
As Puppeteer will not throw an error when an HTTP status code is returned...
page.goto will not throw an error when any valid HTTP status code is returned by the remote server, including 404 "Not Found" and 500 "Internal Server Error".
I usually check the HTTP response status codes to make sure I'm not encountering any 404 Client error responses Bad Request.
let status = await page.goto(`https://github.com/`);
status = status.status();
if (status != 404) {
console.log(`Probably HTTP response status code 200 OK.`);
//...
};
I'm flying blind here as I don't have your target url nor more information on what you're trying to accomplish.
You should also give the puppeteer api documentation a read.
Below approach works for me. Try adding this following "1 Liner" to your code.
The setDefaultNavigationTimeout method allows you to define the timeout of the tab and expects as first argument, the value in milliseconds. Here a value of 0 means an unlimited amount of time. Since I know my page will load up someday.
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({headless: false});
const page = await browser.newPage()
// Add the below 1 line of code
await page.setDefaultNavigationTimeout(0);
// follows the rest of your code block
})();

Simple way to add Firefox Extensions/Add Ons

I know with Pyppeteer (Puppeteer) or Selenium, I can simply add chrome/chromium extensions by including them in args like this:
args=[
f'--disable-extensions-except={pathToExtension}',
f'--load-extension={pathToExtension}'
]
I also know the selenium has the very useful load_extension fx.
I was wondering if there was a similarly easy way to load extensions/add ons in firefox for Playwright? Or perhaps just with the firefox_user_args
I've seen an example in JS using this:
const path = require('path');
const {firefox} = require('playwright');
const webExt = require('web-ext').default;
(async () => {
// 1. Enable verbose logging and start capturing logs.
webExt.util.logger.consoleStream.makeVerbose();
webExt.util.logger.consoleStream.startCapturing();
// 2. Launch firefox
const runner = await webExt.cmd.run({
sourceDir: path.join(__dirname, 'webextension'),
firefox: firefox.executablePath(),
args: [`-juggler=1234`],
}, {
shouldExitProgram: false,
});
// 3. Parse firefox logs and extract juggler endpoint.
const JUGGLER_MESSAGE = `Juggler listening on`;
const message = webExt.util.logger.consoleStream.capturedMessages.find(msg => msg.includes(JUGGLER_MESSAGE));
const wsEndpoint = message.split(JUGGLER_MESSAGE).pop();
// 4. Connect playwright and start driving browser.
const browser = await firefox.connect({ wsEndpoint });
const page = await browser.newPage();
await page.goto('https://mozilla.org');
// .... go on driving ....
})();
Is there anything similar for python?
Tldr; Code at the end
After wasting too much time into this, I have found a way to install extensions in Firefox in Playwright, feature that I think it is not to be supported for now, since Chromium has that feature and works.
Since in firefox adding an extension requires user clicking a special popup that raises when you click to install the extension, I figured it was easier just to download the xpi file and then install it through the file.
To install a file as an extension, we need to get to the url 'about:debugging#/runtime/this-firefox', to install a temporary extension.
But in that url you cannot use the console or the dom due to protection that firefox has and that I haven't been able to avoid.
However, we know that about:debugging runs in a special tab id, so whe can open a new tab 'about:devtools-toolbox' where we can fake user inputs to run commands in a GUI console.
The code on how to run a file is to load the file as 'nsIFile'. To do that we make use of the already loaded packages in 'about:debugging' and we load the required packages.
The following code is Python, but I guess translating it into Javascript should be no big deal
# get the absolute path for all the xpi extensions
extensions = [os.path.abspath(f"Settings/Addons/{file}") for file in os.listdir("Settings/Addons") if file.endswith(".xpi")]
if(not len(extensions)):
return
c1 = "const { AddonManager } = require('resource://gre/modules/AddonManager.jsm');"
c2 = "const { FileUtils } = require('resource://gre/modules/FileUtils.jsm');"
c3 = "AddonManager.installTemporaryAddon(new FileUtils.File('{}'));"
context = await browser.new_context()
page = await context.new_page()
page2 = await context.new_page()
await page.goto("about:debugging#/runtime/this-firefox", wait_until="domcontentloaded")
await page2.goto("about:devtools-toolbox?id=9&type=tab", wait_until="domcontentloaded")
await asyncio.sleep(1)
await page2.keyboard.press("Tab")
await page2.keyboard.down("Shift")
await page2.keyboard.press("Tab")
await page2.keyboard.press("Tab")
await page2.keyboard.up("Shift")
await page2.keyboard.press("ArrowRight")
await page2.keyboard.press("Enter")
await page2.keyboard.type(f"{' '*10}{c1}{c2}")
await page2.keyboard.press("Enter")
for extension in extensions:
print(f"Adding extension: {extension}")
await asyncio.sleep(0.2)
await page2.keyboard.type(f"{' '*10}{c3.format(extension)}")
await page2.keyboard.press("Enter")
#await asyncio.sleep(0.2)
await page2.bring_to_front()
Note that there are some sleep because the page needs to load but Playwright cannot detect it
I needed to add some whitespaces because for some reason, playwright or firefox were missing some of the first characters in the commands
Also, if you want to install more than one addon, I suggest you try to find the amount of sleep before bringing to front in case the addon opens a new tab

When web scraping/testing how do I get passed the notifications popup?

Goal: I'm trying to scrape pictures from instagram using Puppeteer to programmatically log in to my account and start mining data.
The issue: I can log in fine but then I get hit with a popup asking if I want notifications (I turned headless off to see this in action). I'm following the example code for this found here: https://github.com/checkly/puppeteer-examples/blob/master/3.%20login/instagram.js which uses the below try block to find the notification popup and click the 'Not now' button.
//check if the app asks for notifications
try {
await loginPage.waitForSelector(".aOOlW.HoLwm",{
timeout:5000
});
await loginPage.click(".aOOlW.HoLwm");
} catch (err) {
}
The problem is it doesn't actually click the 'Not now' button so my script is stuck in limbo. The selector is pointing to the right div so what gives?
can you please try enabling "notifications" using browserContext.overridePermissions?
you can override notification. This is the code that would, for example, disable the Allow Notifications popup when logging into facebook.
let crawl = async function(){
let browser = await puppeteer.launch({ headless:false });
const context = browser.defaultBrowserContext();
// URL An array of permissions
context.overridePermissions("https://www.facebook.com", ["geolocation", "notifications"]);
let page = await browser.newPage();
await page.goto("https://www.facebook.com");
await page.type("#email", process.argv[2]);
await page.type("#pass", process.argv[3]);
await page.click("#u_0_2");
await page.waitFor(1000);
await page.waitForSelector("#pagelet_composer");
let content2 = await page.$$("#pagelet_composer");
console.log(content2); // .$$ An array containing elementHandles .$ would return 1 elementHandle
}
crawl();

Puppeteer: How to handle multiple tabs?

Scenario: Web form for developer app registration with two part workflow.
Page 1: Fill out developer app details and click on button to create Application ID, which opens, in a new tab...
Page 2: The App ID page. I need to copy the App ID from this page, then close the tab and go back to Page 1 and fill in the App ID (saved from Page 2), then submit the form.
I understand basic usage - how to open Page 1 and click the button which opens Page 2 - but how do I get a handle on Page 2 when it opens in a new tab?
Example:
const puppeteer = require('puppeteer');
(async() => {
const browser = await puppeteer.launch({headless: false, executablePath: '/Applications/Google Chrome.app'});
const page = await browser.newPage();
// go to the new bot registration page
await page.goto('https://register.example.com/new', {waitUntil: 'networkidle'});
// fill in the form info
const form = await page.$('new-app-form');
await page.focus('#input-appName');
await page.type('App name here');
await page.focus('#input-appDescription');
await page.type('short description of app here');
await page.click('.get-appId'); //opens new tab with Page 2
// handle Page 2
// get appID from Page 2
// close Page 2
// go back to Page 1
await page.focus('#input-appId');
await page.type(appIdSavedFromPage2);
// submit the form
await form.evaluate(form => form.submit());
browser.close();
})();
Update 2017-10-25
The work for Browser.pages has been completed and merged
Fixes Emit new Page objects when new tabs created #386 and Request: browser.currentPage() or similar way to access Pages #443.
Still looking for a good usage example.
A new patch has been committed two days ago and now you can use browser.pages() to access all Pages in current browser.
Works fine, tried myself yesterday :)
Edit:
An example how to get a JSON value of a new page opened as 'target: _blank' link.
const page = await browser.newPage();
await page.goto(url, {waitUntil: 'load'});
// click on a 'target:_blank' link
await page.click(someATag);
// get all the currently open pages as an array
let pages = await browser.pages();
// get the last element of the array (third in my case) and do some
// hucus-pocus to get it as JSON...
const aHandle = await pages[3].evaluateHandle(() => document.body);
const resultHandle = await pages[3].evaluateHandle(body =>
body.innerHTML, aHandle);
// get the JSON value of the page.
let jsonValue = await resultHandle.jsonValue();
// ...do something with JSON
This will work for you in the latest alpha branch:
const newPagePromise = new Promise(x => browser.once('targetcreated', target => x(target.page())));
await page.click('my-link');
// handle Page 2: you can access new page DOM through newPage object
const newPage = await newPagePromise;
await newPage.waitForSelector('#appid');
const appidHandle = await page.$('#appid');
const appID = await page.evaluate(element=> element.innerHTML, appidHandle );
newPage.close()
[...]
//back to page 1 interactions
Be sure to use the last puppeteer version (from Github master branch) by setting package.json dependency to
"dependencies": {
"puppeteer": "git://github.com/GoogleChrome/puppeteer"
},
Source: JoelEinbinder # https://github.com/GoogleChrome/puppeteer/issues/386#issuecomment-343059315
According to the Official Documentation:
browser.pages()
returns: <Promise<Array<Page>>> Promise which resolves to an array of all open pages. Non visible pages, such as "background_page", will not be listed here. You can find them using target.page().
An array of all pages inside the Browser. In case of multiple browser contexts, the method will return an array with all the pages in all browser contexts.
Example Usage:
let pages = await browser.pages();
await pages[0].evaluate(() => { /* ... */ });
await pages[1].evaluate(() => { /* ... */ });
await pages[2].evaluate(() => { /* ... */ });
In theory, you could override the window.open function to always open "new tabs" on your current page and navigate via history.
Your workflow would then be:
Override the window.open function:
await page.evaluateOnNewDocument(() => {
window.open = (url) => {
top.location = url
}
})
Go to your first page and perform some actions:
await page.goto(PAGE1_URL)
// ... do stuff on page 1
Navigate to your second page by clicking the button and perform some actions there:
await page.click('#button_that_opens_page_2')
await page.waitForNavigation()
// ... do stuff on page 2, extract any info required on page 1
// e.g. const handle = await page.evaluate(() => { ... })
Return to your first page:
await page.goBack()
// or: await page.goto(PAGE1_URL)
// ... do stuff on page 1, injecting info saved from page 2
This approach, obviously, has its drawbacks, but I find it simplifies multi-tab navigation drastically, which is especially useful if you're running parallel jobs on multiple tabs already. Unfortunately, current API doesn't make it an easy task.
You could remove the need to switch page in case it is caused by target="_blank" attribute - by setting target="_self"
Example:
element = page.$(selector)
await page.evaluateHandle((el) => {
el.target = '_self';
}, element)
element.click()
If your click action is emitting a pageload, then any subsequent scripts being ran are effectively lost. To get around this you need to trigger the action (a click in this case) but not await for it. Instead, wait for the pageload:
page.click('.get-appId');
await page.waitForNavigation();
This will allow your script to effectively wait for the next pageload event before proceeding with further actions.
You can't currently - Follow https://github.com/GoogleChrome/puppeteer/issues/386 to know when the ability is added to puppeteer (hopefully soon)
it looks like there's a simple 'page.popup' event
Page corresponding to "popup" window
Emitted when the page opens a new tab or window.
const [popup] = await Promise.all([
new Promise(resolve => page.once('popup', resolve)),
page.click('a[target=_blank]'),
]);
const [popup] = await Promise.all([
new Promise(resolve => page.once('popup', resolve)),
page.evaluate(() => window.open('https://example.com')),
]);
credit to this github issue for easier 'targetcreated'
You can have multiple inheritance from browser.newPage() to open multiple tabs
Example
const page = await browser.newPage();
await page.goto("https://www.google.com/");
const page2 = await browser.newPage();
await page2.goto("https://www.youtube.com/");

Resources