When I use the inspect/developer tool in chrome I can find the last modified date from browser but I want to see the same date in my nodeJS application.
I have already tried
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://www.tbsnews.net/economy/bsec-chairman-stresses-restoring-investor-confidence-mutual-funds-500126');
const names = page.evaluate( ()=> {
console.log(document.lastModified);
})
Unfortunately this code shows the current time of new DOM creation as we are using newPage(). Can anyone help me ?
I have also tired JSDOM as well.
Thanks in advance.
I am trying to scrape data from different websites using only one Puppeteer instance. I don't want to launch a new browser for each website. So I need to check if any existing browser has already launched then just open a new tab. I did something like the below, Some conditions I always check before launching any browser
const browser = await puppeteer.launch();
browser?.isConnected()
browser.process() // null if browser is still running
but still, I found sometimes my script re-launch the browser if any old browser has already been launched. So I am thinking to kill if any old browser has been launched or what would be the best check? Any other good suggestion will be highly appreciated.
I'm not sure if that specific command (Close existing browsers) can be done inside puppeteer's APIs, but what I could recommend is how would people usually handle this situation which is to make sure that the browser instance is closed if any issue was encountered:
let browser = null;
try {
browser = await puppeteer.launch({
headless: true,
args: ['--no-sandbox'],
});
const page = await browser.newPage();
url = req.query.url;
await page.goto(url);
const bodyHTML = await page.evaluate(() => document.body.innerHTML);
res.send(bodyHTML);
} catch (e) {
console.log(e);
} finally {
if (browser)
await browser.close();
}
Otherwise, you can use shell based commands like kill or pkill if you have access to the process ID of the previous browser.
The most reliable means of closing a puppeteer instance that I've found is to close all of the pages within a BrowserContext, which automatically closes the Browser. I've seen instances of chromium linger in Task Manager after calling just await browser.close().
Here is how I do this:
const openAndCloseBrowser = async () => {
const browser = await puppeteer.launch();
try {
// your logic
catch(ERROR) {
// error handling
} finally {
const pages = await browser.pages();
for(const page of pages) await page.close();
}
}
If you try running await browser.close() after running the loop and closing each page individually, you should see an error stating that the browser was already closed and your Task Manager should not have lingering chromium instances.
I’m trying to webscrape a press site, open every link of the articles and get the data. I was able to webscrape with puppeteer but cannot upload it to fire base cloud storage. How do I do that every hour or so?
I webscraped in asynchrones function and then called it in the cloud function:
I used puppeteer to scrape the links of the articles from newsroom website and then used the links to get more information from the articles. I first had everything in a single async function but cloud functions threw an error that there should not be any awaits in a loop.
UPDATE:
I implanted the code above in a firebase function but still get no-await in loop error.
There is a couple of things wrong here, but you are on a good path of getting this to work. The main problem is, that you can't have await within a try {} catch {} block. Asynchronous JavaScript has a different way of dealing with errors. See: try/catch blocks with async/await.
In your case, it's totally fine to write everything in one async function. Here is how I would do it:
async function scrapeIfc() {
const completeData = [];
const url = 'https://www.ifc.org/wps/wcm/connect/news_ext_content/ifc_external_corporate_site/news+and+events/pressroom/press+releases';
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url);
await page.setDefaultNavigationTimeout(0);
const links = await page.evaluate(() =>
Array.from(document.querySelectorAll('h3 > a')).map(anchor => anchor.href)
);
for (const link of links) {
const newPage = await browser.newPage();
await newPage.goto(link);
const data = await newPage.evaluate(() => {
const titleElement = document.querySelector('td[class="PressTitle"] > h3');
const contactElement = document.querySelector('center > table > tbody > tr:nth-child(1) > td');
const txtElement = document.querySelector('center > table > tbody > tr:nth-child(2) > td');
return {
source: 'ITC',
title: titleElement ? titleElement.innerText : undefined,
contact: contactElement ? contactElement.innerText : undefined,
txt: txtElement ? txtElement.innerText : undefined,
}
})
completeData.push(data);
newPage.close();
}
await browser.close();
return completeData;
}
There is couple of other things you should note:
You have a bunch of unused import title, link, resolve and reject the head of your script, which might have been added automatically by your code editor. Get rid of them, as they might overwrite the real variables.
I changed your document.querySelectors to be more specific, as I couldn't select the actual elements from the ITC website. You might need to revise them.
For local development I use Google's functions-framework, which helps me to run and test the function locally before deploying. If you have errors on your local machine, you'll have error when deploying to Google Cloud.
(Opinion) If you don't need Firebase, I would run this with Google Cloud Functions, Cloud Scheduler and the Cloud Firestore. For me, this has been the go-to workflow for periodic web scraping.
(Opinion) Puppeteer might be overkill for scraping a simple static website, since it runs in a headless Browser. Something like Cheerio is much more lightweight and much faster.
Hope I could help. If you encounter other problems, let us know. Welcome to the Stack Overflow community!
For the context, I am developing a synthetic monitoring tool using Nodejs and puppeteer.
For each step of a defined scenario, I capture a screenshot, a waterfall and performance metrics.
My problem is on the waterfall, I previously used puppeter-har but this package is not able to capture request outside of a navigation.
Therefore I use this piece of code to capture all interesting requests :
const {harFromMessages} = require('chrome-har');
// Event types to observe for waterfall saving (probably overkill, I just set all events of Page and Network)
const observe = [
'Page.domContentEventFired',
'Page.fileChooserOpened',
'Page.frameAttached',
'Page.frameDetached',
'Page.frameNavigated',
'Page.interstitialHidden',
'Page.interstitialShown',
'Page.javascriptDialogClosed',
'Page.javascriptDialogOpening',
'Page.lifecycleEvent',
'Page.loadEventFired',
'Page.windowOpen',
'Page.frameClearedScheduledNavigation',
'Page.frameScheduledNavigation',
'Page.compilationCacheProduced',
'Page.downloadProgress',
'Page.downloadWillBegin',
'Page.frameRequestedNavigation',
'Page.frameResized',
'Page.frameStartedLoading',
'Page.frameStoppedLoading',
'Page.navigatedWithinDocument',
'Page.screencastFrame',
'Page.screencastVisibilityChanged',
'Network.dataReceived',
'Network.eventSourceMessageReceived',
'Network.loadingFailed',
'Network.loadingFinished',
'Network.requestServedFromCache',
'Network.requestWillBeSent',
'Network.responseReceived',
'Network.webSocketClosed',
'Network.webSocketCreated',
'Network.webSocketFrameError',
'Network.webSocketFrameReceived',
'Network.webSocketFrameSent',
'Network.webSocketHandshakeResponseReceived',
'Network.webSocketWillSendHandshakeRequest',
'Network.requestWillBeSentExtraInfo',
'Network.resourceChangedPriority',
'Network.responseReceivedExtraInfo',
'Network.signedExchangeReceived',
'Network.requestIntercepted'
];
At the start of the step :
// list of events for converting to HAR
const events = [];
client = await page.target().createCDPSession();
await client.send('Page.enable');
await client.send('Network.enable');
observe.forEach(method => {
client.on(method, params => {
events.push({ method, params });
});
});
At the end of the step :
waterfall = await harFromMessages(events);
It works good for navigation events, and also for navigation inside a web application.
However, the web application I try to monitor has iframes with the main content.
I would like to see the iframes requests into my waterfall.
So a few question :
Why is Network.responseReceived or any other event doesn't capture this requests ?
Is it possible to capture such requests ?
So far I've red the devtool protocol documentation, nothing I could use.
The closest to my problem I found is this question :
How can I receive events for an embedded iframe using Chrome Devtools Protocol?
My guess is, I have to enable the Network for each iframe I may encounter.
I didn't found any way to do this. If there is a way to do it with devtool protocol, I should have no problem to implement it with nodsjs and puppeteer.
Thansk for your insights !
EDIT 18/08 :
After more searching on the subject, mostly Out-of-process iframes, lots of people on the internet point to that response :
https://bugs.chromium.org/p/chromium/issues/detail?id=924937#c13
The answer is question states :
Note that the easiest workaround is the --disable-features flag.
That said, to work with out-of-process iframes over DevTools protocol,
you need to use Target [1] domain:
Call Target.setAutoAttach with flatten=true;
You'll receive Target.attachedToTarget event with a sessionId for the iframe;
Treat that session as a separate "page" in chrome-remote-interface. Send separate protocol messages with additional sessionId field:
{id: 3, sessionId: "", method: "Runtime.enable", params:
{}}
You'll get responses and events with the same "sessionId" field which means they are coming from that frame. For example:
{sessionId: "", method: "Runtime.consoleAPICalled",
params: {...}}
However I'm still not able to implement it.
I'm trying this, mostly based on puppeteer :
const events = [];
const targets = await browser.targets();
const nbTargets = targets.length;
for(var i=0;i<nbTargets;i++){
console.log(targets[i].type());
if (targets[i].type() === 'page') {
client = await targets[i].createCDPSession();
await client.send("Target.setAutoAttach", {
autoAttach: true,
flatten: true,
windowOpen: true,
waitForDebuggerOnStart: false // is set to false in pptr
})
await client.send('Page.enable');
await client.send('Network.enable');
observeTest.forEach(method => {
client.on(method, params => {
events.push({ method, params });
});
});
}
};
But I still don't have my expected output for the navigation in a web application inside an iframe.
However I am able to capture all the requests during the step where the iframe is loaded.
What I miss are requests that happened outside of a proper navigation.
Does anyone has an idea about the integration into puppeteer of that chromium response above ? Thanks !
I was looking on the wrong side all this time.
The chrome network events are correctly captured, as I would have seen earlier if I checked the "events" variable earlier.
The problem comes from the "chrome-har" package that I use on :
waterfall = await harFromMessages(events);
The page expects the page and iframe main events to be present in the same batch of event than the requests. Otherwise the request "can't be mapped to any page at the moment".
The steps of my scenario being sometimes a navigation in the same web application (=no navigation event), I didn't have these events and chrome-har couldn't map the requests and therefore sent an empty .har
Hope it can help someone else, I messed up the debugging on this one...
Goal: I'm trying to scrape pictures from instagram using Puppeteer to programmatically log in to my account and start mining data.
The issue: I can log in fine but then I get hit with a popup asking if I want notifications (I turned headless off to see this in action). I'm following the example code for this found here: https://github.com/checkly/puppeteer-examples/blob/master/3.%20login/instagram.js which uses the below try block to find the notification popup and click the 'Not now' button.
//check if the app asks for notifications
try {
await loginPage.waitForSelector(".aOOlW.HoLwm",{
timeout:5000
});
await loginPage.click(".aOOlW.HoLwm");
} catch (err) {
}
The problem is it doesn't actually click the 'Not now' button so my script is stuck in limbo. The selector is pointing to the right div so what gives?
can you please try enabling "notifications" using browserContext.overridePermissions?
you can override notification. This is the code that would, for example, disable the Allow Notifications popup when logging into facebook.
let crawl = async function(){
let browser = await puppeteer.launch({ headless:false });
const context = browser.defaultBrowserContext();
// URL An array of permissions
context.overridePermissions("https://www.facebook.com", ["geolocation", "notifications"]);
let page = await browser.newPage();
await page.goto("https://www.facebook.com");
await page.type("#email", process.argv[2]);
await page.type("#pass", process.argv[3]);
await page.click("#u_0_2");
await page.waitFor(1000);
await page.waitForSelector("#pagelet_composer");
let content2 = await page.$$("#pagelet_composer");
console.log(content2); // .$$ An array containing elementHandles .$ would return 1 elementHandle
}
crawl();