Puppeteer iterates through all pages but browser won't close after - node.js

After the program iterates through all pages, it doesn't break out of the while loop and closes the browser. Instead, it ran through the while loop one extra time and gave me an error: "TimeoutError: waiting for selector `.pager .next a` failed: timeout 30000ms exceeded" What went wrong?
const puppeteer = require("puppeteer");
(async () => {
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
await page.goto("http://books.toscrape.com/");
let isLastPage = false;
while (!isLastPage) {
await page.waitForSelector(".pager .next a");
isLastPage = (await page.$(".pager .next a")) === null;
await page.click(".pager .next a");
}
console.log("done");
await browser.close();
})();

Your last page detection logic is just flawed. While you're on a page, you're trying to both see if ".pager .next a" exists AND you're trying to click that. Obviously if it doesn't exist, you can't click it.
What you want to do is make sure the page is loaded by waiting for .pager .current which is a part of the navigation footer that will be there on every page. Then, check if .pager .next a is there BEFORE you click and if it's not there, then you can just break out of the while loop. If the page is dynamic and you need to use puppeteer, then you can do something like this:
const puppeteer = require("puppeteer");
(async () => {
const browser = await puppeteer.launch({ headless: false });
try {
const page = await browser.newPage();
await page.goto("http://books.toscrape.com/");
let cntr = 0;
while (true) {
await page.waitForSelector(".pager .current");
console.log(`page ${++cntr}`);
// process the page content here
if ((await page.$(".pager .next a")) === null) {
break;
}
await page.click(".pager .next a");
}
console.log("done");
} finally {
await browser.close();
}
})();
And, to make sure that you always close the browser even upon errors, you need to catch any errors and make sure you close the browser in those conditions. In this case, you can use try/finally.
If the page is not dynamic, then you can also just use plain GET requests and use cheerio to examine what's in the page which is simpler and doesn't involve loading the whole chromium browser engine.

Related

How to tell Puppeteer to finish everything and give me what it has so far

I am struggling to get Puppeteer to just finish all activity so I can get the page content. It just fails silently on await page.content();.
This is for pages that often have few resources that never really finish loading, so the browser is stuck in page not finished loading state. (Chromium tab spinner is spinning).
So I am looking for a way to force the page so conclude the rendering process with everything it has done so far. I assumed await client.send("Page.stopLoading"); would do that for me but it's pretty much like pressing X on the browser and it's not telling Chromium to finish loading.
(async () => {
const browser = await p.launch({
defaultViewport: { width: 800, height: 600, isMobile: true },
});
const page = await browser.newPage();
const client = await page.target().createCDPSession();
await page.setUserAgent(ua);
try {
await page.goto(url, { waitUntil: "networkidle2", timeout: 5000 }).catch(e => void e);
await client.send("Page.stopLoading");
console.log("tried to stop");
const content = await page.content();
console.log(content);
} catch (e) {
console.log("fail");
}
await browser.close();
})();
Also tried to create an array of all requests and looking for ones that interceptor has not handled in order to abort hanging requests, but that approach produced nothing either.

playwright - get content from multiple pages in parallel

I am trying to get the page content from multiple URLs using playwright in a nodejs application. My code looks like this:
const getContent = async (url: string): Promise<string> {
const browser = await firefox.launch({ headless: true });
const page = await browser.newPage();
try {
await page.goto(url, {
waitUntil: 'domcontentloaded',
});
return await page.content();
} finally {
await page.close();
await browser.close();
}
}
const items = [
{
urls: ["https://www.google.com", "https://www.example.com"]
// other props
},
{
urls: ["https://www.google.com", "https://www.example.com"]
// other props
},
// more items...
]
await Promise.all(
items.map(async (item) => {
const contents = [];
for (url in item.urls) {
contents.push(await getContent(url))
}
return contents;
}
)
I am getting errors like error (Page.content): Target closed. but I noticed that if I just run without loop:
const content = getContent('https://www.example.com');
It works.
It looks like each iteration of the loops share the same instance of browser and/or page, so they are closing/navigating away each other.
To test it I built a web API with the getContent function and when I send 2 requests (almost) at the same time one of them fails, instead if send one request at the time it always works.
Is there a way to make playwright work in parallel?
I don't know if that solves it, but noticed there are two missing awaits. Both the firefox.launch(...) and the browser.newPage() are asynchronous and need an await in front.
Also, you don't need to launch a new browser so many times. PlayWright has the feature of isolated browser contexts, which are created much faster than launching a browser. It's worth experimenting with launching the browser before the getContent function, and using
const context = await browser.newContext();
const page = await context.newPage();

Puppeteer: Staying on page too long closes the browser (Protocol error (Runtime.callFunctionOn): Session closed. Most likely the page has been closed)

My plan is to connect to a page, interact with its elements for a while, and then wait and start over. Since the process of accessing the page is complicated, I would ideally log in only once, and then permanently stay on page.
index.js
const puppeteer = require('puppeteer');
const creds = require("./creds.json");
(async () => {
try {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://www.online-messenger.com');
await goToChats(page);
await page.waitForSelector('div[aria-label="Chats"]');
setInterval(async () => {
let index = 1;
while (index <= 5) {
if (await isUnread(page, index)) {
await page.click(`#local-chat`);
await page.waitForSelector('div[role="main"]');
let conversationName = await getConversationName(page);
if (isChat(conversationName)) {
await writeMessage(page);
}
}
index++;
}
}, 30000);
} catch (e) { console.log(e); }
await page.close();
await browser.close();
})();
Again, I do not want to close the connection, so I thought add the setInterval() would help me with the problem. The core code works absolutely fine, but every single time I run the code with the interval function I get this error:
Error: Protocol error (Runtime.callFunctionOn): Session closed. Most likely the page has been closed.
I timed the main part of my code and it would typically take around 20-25 seconds. I thought the problem lies in the delay set to 30 seconds, but I get the same error even when I increase it to e.g. 60000 (60 seconds).
What am I doing wrong? Why is setInterval not working, and is there possibly a different way of tackling my problem?
Okay, after spending some more time on the problem, I realised it was indeed the setInterval function that caused all the errors.
The code is asynchronous and in order to make it all work, I had to use an 'async' version of setInterval(). I wrapped my code in an endless loop, which finishes with promise, which resolves after a specified time.
...
await goToChats(page);
await page.waitForSelector('div[aria-label="Chats"]');
while(true) {
let index = 1;
while (index <= 5) {
if (await isUnread(page, index)) {
await page.click(`#local-chat`);
await page.waitForSelector('div[role="main"]');
let conversationName = await getConversationName(page);
if (isChat(conversationName)) {
await writeMessage(page);
}
}
waitBeforeNextIteration(10000);
index++;
}
...
waitBeforeNextIteration(ms) {
return new Promise(resolve => setTimeout(resolve, ms))
}

Puppeteer Devtools Programaticaly

I can open the devtools that exist in Puppeteer, but I cannot write data to the console section and export the log of this data to the cmd screen?
In Puppeteer, I want to print to console as below and get the output below.
Screenshot
You are asking for two things here
Capture console.log messages to the command prompt
Run a javascript command inside puppeteer
For the first point you can set the option dumpio: true as a option
For the second point you can jump into the page using evaluate and make a call to console.log
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({
dumpio: true
});
const page = await browser.newPage();
const url = "https://stackoverflow.com";
await page.goto(url);
await page.waitFor('h1');
await page.evaluate(() => {
console.log(document.getElementsByTagName("h1")[0].innerText);
});
console.log("Done.")
await browser.close();
})();
Also for brevity if you are getting to much output you can omit dumpio and instead catch the log as an event e.g.
page.on('console', (msg) => console[msg._type]('PAGE LOG:', msg._text));
await page.waitFor('h1');
await page.evaluate(() => {
console.log(1 + 2);
console.log(document.getElementsByTagName("h1")[0].innerText);
});
the second script returns
PAGE LOG: 3
PAGE LOG: We <3 people who code
Done.

Web Automation to go to the next page - Invoking method that returns Promise within await block) - await is only valid in async

Background:
I am writing a Nodejs script with puppeteer to web scrape data from a web page. I'm not familiar with Nodejs, promises, or puppeteer. I've tried many things and done research for a few days.
Application Flow:
With automation, go to a website
Scrape data from the page, push to an array
If there is a "next page" click the next page button
Scrape data from the page, push to same array
Repeat
Problem:
My problem is with #3. With web automation, clicking the next page button.
All I want, is to use the .click() method in puppeteer, to click on the button selector. However, .click() returns a Promise. Since it's a promise, I need keyword await, but you can't have await in the for loop (or any block other than async).
What Have I Tried:
I've tried creating another async function, with statements for await page.click();and calling that function in the problem area. I've tried creating a regular function with page.click() and calling that in the problem area. Refactoring everything to have it not work as well. I'm not really understanding Promises and Async/Await even after reading about it for a few days.
What I Want Help With:
Help with invoking the .click() method inside the problem area or any help with selecting the 'Next Page' using web automation.
Pseudo Code:
let scrape = async () => {
await //do.some.automation;
const result = await page.evaluate(() => {
for (looping each page) {
if (there is a next page) {
for (loop through data) {
array.push(data);
//----PROBLEM----
//use automation to click the selector of the next page button
//--------------
}
}
}
return data;
});
//close browser
return result;
};
scrape().then((value) => {
//output data here;
});
});
All Code:
let scrape = async () => {
const browser = await puppeteer.launch({
headless: false
});
const page = await browser.newPage();
await page.goto("GO TO A WEBSITE");
await page.click("CLICK A BUTTON");
await page.waitFor(2000);
//Scraping
const result = await page.evaluate(() => {
let pages = document.getElementsByClassName("results-paging")[2];
let allPages = pages.getElementsByClassName("pagerLink");
let allJobs = [];
//Loop through each page
for (var j = 0; j < allPages.length; j++) {
let eachPage = pages.getElementsByClassName("pagerLink")[j].innerHTML;
if (eachPage) {
//Scrape jobs on single page
let listSection = document.getElementsByTagName("ul")[2];
let allList = listSection.getElementsByTagName("li");
for (var i = 0; i < allList.length; i++) {
let eachList = listSection.getElementsByTagName("li")[i].innerText;
allJobs.push(eachList);
//--------PROBLEM-------------
await page.click('#selector_of_next_page');
//----------------------------
}
}
else {
window.alert("Fail");
}
}
return allJobs;
});
browser.close();
return result;
};
scrape().then((value) => {
let data = value.join("\r\n");
console.log(data);
fs.writeFile("RESULTS.txt", data, function (err) {
console.log("SUCCESS MESSAGE");
});
});
Error Message:
SyntaxError: await is only valid in async function
You can not use page methods inside page.evaluate function.
Based on your example you should change
await page.click('#selector_of_next_page');
to native JS equivalent
document.getElementById('selector_of_next_page').click();

Resources